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INTRODUCTION: 

A  substantial  number  of  breast  cancer  patients  initially  present  with  metastases,  or  are  at  substantial 
risk  of  relapse  after  surgery.  A  therapeutic  that  would  seek  out  and  destroy  these  metastases,  either  alone  or  in 
combination  with  other  therapies,  would  be  of  substantial  benefit. 

Salmonella  enterica  sv  Tjphimurium. ,  a  facultative  anaerobic  bacterium  that  infects  both  mice  and  humans, 
naturally  accumulates  in  a  wide  variety  of  solid  tumors  versus  normal  mouse  tissue  at  a  ratio  of  1000:1  (/), 
seemingly  preferring  the  tumor  environment  over  any  other  niche  in  the  host.  The  bacterium  has  been  used 
successfully  to  selectively  kill  tumors  ( 2-4)  and  to  deliver  proteins  for  cytotoxic  or  other  therapeutic  strategies 
to  tumor  tissue  in  mice  {5-16).  We  have  shown  that  the  Tjphimurium  A1  strain  {leu,  aryl)  effectively  reduces  the 
growth  of  PC3  and  breast  tumor  xenografts  in  nude  mice  while  being  virtually  avirulent  in  this  host  organism 
(5,6).  Recently  we  observed  cures  of  orthotopic  human  PC3  cancer  metastases  by  Salmonella  in  mice  (17). 

In  the  first  year  of  this  project,  we  have  screened  mutations  in  all  non-essential  genes  in  Salmonella  to 
identify  mutants  that  are  unable  to  grow  well  in  normal  host  tissues,  and  are  therefore  harmless  to  humans, 
but  thrive  in  cancer  models  in  vivo.  In  addition,  we  have  screened  for  Salmonella  promoters  that  are 
preferentially  active  in  the  tumor  environment.  These  promoters  can  be  used  to  selectively  express  cloned 
therapeutic  proteins  in  tumors  and  export  them  outside  the  bacterium,  if  necessary,  while  minimizing  the  side 
effects  of  such  therapeutics  in  the  rest  of  the  body  {18-20).  Improved  growth  specificity  in  tumors  combined 
with  expression  of  therapeutics  from  promoters  with  preferential  activity  in  tumor  tissues  may  result  in  a  very 
specific  and  inexpensive  vector  for  control  of  metastases. 

Although  funding  is  organized  by  tissue  of  origin,  it  makes  scientific  sense  not  to  confine  our  data 
only  to  one  tissue  of  origin.  Success  in  attacking  breast  cancer  will  be  improved  if  we  can  demonstrate  safety 
in  any  cancer  type.  Similarly,  good  performance  of  our  Salmonella  mutants  in  any  cancer  enhances  the 
chances  of  success  in  breast.  Thus,  in  a  set  of  experiments  of  direct  relevance  to  the  tasks  and  goals  of  this 
current  project,  which  is  designed  to  test  safety  and  efficacy  of  one  tumor  type,  we  continue  to  actively  pursue 
the  properties  of  avirulent  Salmonella  in  other  tumor  types  and  have  published  a  series  of  papers  in  this  project 
year  that  impact  on  the  project  (21-25).  In  the  first  year  of  the  project,  two  other  manuscripts  are  already  in 
preparation  on  the  tasks  in  this  project. 

The  approaches  we  have  taken  in  this  project  have  generated  data  of  a  kind  never  previously 
generated,  or  for  which  tools  have  never  been  developed,  or  both.  This  fact  has  required  us  to  develop  new 
analysis  tools  because  such  tools  did  not  exist.  Our  work  to  develop  such  tools  is  a  vital  and  enduring  product 
of  this  project.  Furthermore,  all  of  our  tools  are  made  available  on  the  web,  making  them  available  to  others 
funded  by  this  mechanism  and  by  other  mechanisms  at  DOD.  In  the  current  reporting  period  we  submitted 
one  paper  on  such  a  tool,  which  is  widely  applicable,  not  only  to  Salmonella  in  breast  cancer  (Xia  et  al, 
WebArrayDB:  cross-platform  microarray  data  analysis  and  public  data  repository.  Submitted.  See  appendix). 
We  also  submitted  a  methods  paper  on  the  use  of  this  tool  (Wang  et  al.,  Analyzing  microarray  data  using 
WebArray.  Submitted.  See  appendix).  We  also  submitted  a  paper  on  improving  oligonucleotide  selection  for 
arrays,  which  also  improves  the  tasks  in  this  project  In  the  first  year  of  the  project,  two  other  manuscripts  are 
already  in  preparation. 

A  review  article  that  includes  some  of  the  strains  and  topics  in  this  project  was  published  in  the 
reporting  period  (24). 
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BODY: 

Aim  1.  Task  1.  Screen  for  fitness  mutants  in  the  tumor  and  normal  tissue  environment  using  a 
library  of  transposon-tagged  Salmonella,  (year  1); 

In  the  reporting  period  a  manuscript  was  planned  for  the  data  from  this  task  and  will  be  submitted  in 
die  next  reporting  period.  The  experiments  involved  4T1  breast  tumor  lines,  prostate  tumor  lines  and 
melanoma  (MDA-MB-435)  tumors  on  the  theory  that  the  most  successful  Salmonella  strains  for  therapy 
would  target  diverse  tumors. 

Microarray  analysis  to  determine  fitness  in  normal  tissues  and  tumors.  A  library  of  40,000 
Salmonella  transposon  mutants  containing  mini-Tn5  transposon  insertions  was  injected  into  twelve  tumors 
growing  in  12  nude  mice.  Three  tumor-free  mice  were  injected  intravenously  with  the  same  Salmonella  library. 
Bacteria  were  recovered  after  two  days  from  tumors  and  from  the  spleens,  livers,  and  lungs  of  tumor-free 
mice. 

During  in  vivo  selection,  mutants  in  genes  contributing  to  fitness  in  that  selective  environment  are  lost 
from  the  library.  Differences  in  the  mutant  library  composition  before  (input  library)  and  after  selection 
(output  library)  can  be  detected  using  microarray  hybridization:  The  transposon  sequence  carries  the  T7 
promoter  sequence,  allowing  the  specific  amplification  of  genomic  sequences  adjacent  to  each  insertion, 
which  are  then  mapped  on  the  Salmonella  genome  using  a  gene  microarray.  This  study  revealed  two  distinct 
classes  of  phenotypes:  Class  1  mutants.  This  class  contains  Salmonella  mutants  with  reduced  fitness  in  normal 
tissues  (spleen,  liver,  lung)  and  unchanged  fitness  in  tumors.  We  identified  mutants  affecting  at  least  19  distinct 
genes  within  the  SPI-2  island  (e.g.,  ssrA,  ssaB,  ssaC,  ssaD,  sseB,  sscA,  sseQ  sseE,  ssaj,  STM1410,  ssaK,  ssaL,  ssaM, 
ssaV,  ssaN,  ssaP,  ssaQjseR,  ssaT).  In  addition,  mutations  in  genes  involved  in  a  number  of  cellular  functions 
were  identified.  These  include  htrA,  phoP,  and  sifA  and  a  hypothetical  operon  containing  a  putative  acetyl- 
COA  hydrolase  (STM3118),  a  putative  monoamine  oxidase  (STM3119)  and  two  putative  lysR  family 
transcriptional  regulators  (STM3120,  STM3121).  Many  of  these  mutations  have  previously  been  observed  to 
be  associated  with  fitness  in  spleen  (25,26).  The  observation  of  a  similar  effect  on  fitness  in  liver  and  lung  is 
new,  though  not  unexpected.  The  fact  that  these  mutants  remain  fit  in  tumors  relative  to  other  mutants  is  new 
and  of  potential  practical  importance  for  Salmonella  use  as  a  direct  therapy  or  for  therapy  delivery. 

Class  2  mutants.  This  class  contains  mutants  with  reduced  fitness  both  in  normal  tissues  and  in 
tumor  tissues.  Three  mutants  of  the  same  operon  involved  in  the  synthesis  of  aromatic  compounds  were 
identified:  arvM,  aroD  and  aroA.  Previous  reports  describe  the  use  of  Salmonella  aroA  and  aroD  mutants  in 
cancer  therapy  (27).  Mutants  of  lipopolysacharide  genes  belonging  to  the  rfa  and  rfb  clusters  were  identified  in 
this  class  (e.g.,  rfbK,  rfbM,  rfbC,  rfaQ).  While  class  2  mutants  are  either  known  to  be  avirulent  or  likely  to  be  of 
reduced  virulence,  their  impaired  growth  in  tumors  relative  to  class  1  mutants  may  make  them  less  desirable  as 
strains  for  delivery  of  therapy. 

Task  2  of  this  aim  in  year  2  will  take  some  of  these  mutants  and  test  them  in  tumors. 

Tumor  targeting  of  STM3120  using  syngeneic  orthotopic  4T1  breast  tumors.  The  tumor 
targeting  capability  of  the  STM3120  knockout  mutant  was  tested  following  intragastric  delivery  into  4T1 
murine  breast  tumor  growing  orthotopically  in  the  mammary  fat  pads  of  BALB/ c  mice.  Five  six  week-old 
BALB/c  mice  bearing  4T1  tumors  were  each  orally  injected  with  7  x  108  cfu  of  STM3120,  tumor  biopsies 
were  taken  2,  5,  7  and  9  days  later  using  Gallini  Medical  Devices  needles  and  bacterial  counts  determined. 
Bacteria  were  detected  in  three  mice  7  days  after  administration.  At  day  9,  bacterial  counts  ranged  from  2  x  104 
to  9  x  105  cfu  per  biopsy  in  all  5  mice. 
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These  results  (summarized 
as  Table  1)  suggest  that  intragastric 
delivery  of  STM3120  allows  a 
sufficient  number  of  bacteria  (~107  - 
5  x  108  cfu  per  tumor)  to  taiget  and 
multiply  in  the  tumor  environment 
to  levels  that  have  previously  been 
shown  to  effectively  reduce  tumor 
size  after  intratumoral  or  intravenous 
injections  (30).  This  is  of  importance 
because  intragastric  delivery  of  a 
therapeutic  Salmonella  strain  offers 
increased  convenience  over 
intravenous  delivery.  A  similar  finding  was  recently  made  by  Jia  and  coworkers  (31),  showing  a  significant 
anticancer  effect  of  orally  administered  VNP20009  into  C57b6  mice  bearing  syngeneic  subcutaneous  B16F10 
melanoma  and  Lewis  lung  carcinoma. 

Class  1  mutants  that  retain  tumor-targeting  while  being  poor  colonizers  of  normal  tissue  seem  best 
suited  for  delivery  of  cancer  therapeutics.  However,  mutants  will  need  to  be  tested  in  the  intended  host, 
whether  it  be  humans  or  companion  animals  with  cancer,  before  the  best  candidates  for  the  host  can  be 
determined.  We  have  shown  that  high-throughput  transposon  library  screening  allows  the  identification  of 
novel  Salmonella  mutations  of  potential  therapeutic  value,  and  also  allows  the  re-evaluation  of  Salmonella 
mutants  previously  used  in  cancer  therapy.  Such  approaches  can  be  adapted  to  any  host  and  tumor  model  and 
a  wide  variety  of  bacterial  species. 


Table  1.  Growth  of  STM3120  mutant  in  orthotopic  4T1  breast 
tumors  after  intragastric  delivery.  Numbers  represent  cfu  per 
biopsy  per  mouse  taken  at  different  days  following  injection.  Dash 
indicates  a  level  of  bacteria  below  detection. 

mouse 

Days 

1 

2 

3 

4 

5 

2 

- 

- 

- 

- 

- 

5 

- 

- 

- 

- 

- 

7 

- 

2  00E+05 

1.25E+02 

4.00E+05 

- 

9 

3  00E+04 

7.50E+05 

5  00E+04 

9  00E+05 

2  00E+04 

Aim  1.  Task  2.  To  test  individual  mutants  for  avirulence  and  tumor-selective  properties  (year  2  and 
3). 


In  mid-2009  plans  were  well  on  the  way  to  screen  some  of  the  above  mutants  in  tumors.  In  addition. 
Plans  were  in  place  to  screen  over  1000  individual  mutants  for  avirulence  in  mice.  The  results  of  these 
experiments  will  be  reported  in  the  year  two  annual  report. 
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Aim  2.  Task  1.  To  identify  DNA  sequences  that  act  as 
promoters  in  tumors  but  not  in  normal  tissue  (year  1). 

This  step  of  the  project  was  partly  published  (32).  Work 
building  on  this  advance  has  been  filed  as  a  patent  (attached  as  an 
appendix). 

Screening  of  in-vivo  tumor-activated  promoters.  GFP- 
promoter  libraries  constructed  in  a  vector  that  we  created  (Figure  1) 
were  mixed  and  injected  IT  into  four  human  tumor-bearing  nude 
mice.  After  two  days,  tumors  were  combined,  homogenized  and 
analyzed  by  FACS.  GFP-positive  cells  were  recovered  and 
expanded  overnight  in  LB  containing  ampicillin.  To  eliminate 
clones  harboring  constitutive  promoters,  the  tumor  library  was 
subjected  to  a  negative  FACS  sort  after  overnight  growth  in  LB 
and  a  subsequent  second  positive  FACS  sort  2  days  after  a  second 
passage  in  tumors.  We  have  optimized  the  FACS  analysis  to 
discriminate  between  true  green  cells  and  other  fluorescent 
particles.  This  was  possible  by  measuring  the  ratio  of 
fluorescence/auto-fluorescence  versus  side  scatter  on  the  X-axis. 
Figure  2  shows  the  FACS  analysis  of  a  sub-library  after  2  passages 
in  tumors. 


Genome-wide  survey  of  tumor-activated  promoters  using  Nimblegen  arrays.  Plasmid  DNA 
was  extracted  from  the  original  promoter  library  (Library-0),  from  a  sub-library  of  clones  activated  in  spleen, 
and  from  the  sub-libraries  of  clones  activated  in  subcutaneous  PC3  tumors  in  nude  mice  after  two  passages  in 
tumors.  Promoter  sequences  were  recovered  by  PCR  and  labeled  by  CY5  (Library-0)  and  CY3  (spleen  or 
tumor  library)  and  then  hybridized  to  an  array  of  387,000  oligonucleotide  sequences  spaced  at  12  base 
intervals  around  the  Typhimurium  genome  (NimbleGen).  Using  a  threshold  of  two-fold  in  hybridization 
signal  relative  to  the  control  (Library-0),  there  were  86  intergenic  regions  enriched  in  tumor  but  not  in  the 
spleen.  Twenty-two  intergenic  regions  are  already  cloned  (see  table  below)  and  174  intergenic  regions  enriched 
in  both  tumor  and  spleen  (data  not  shown). 


Figure  1. 

Promoter  library  construction 

•300-550bp  size  class  library  of 
Salmonella  genome  fragments. 

•  Upstream  of  a  promoter-less  TurboCFP. 

•  Flanked  by  transcriptional  terminators. 


TurDoGFP 


TurboGFP: 

•  Fastest  folding  GFP  known  (a  few  minutes) 
-  Brightest  GFP  known. 

•  Destabilized  the  protein  by  adding  a  signal 
to  make  half-life  less  than  one  hour. 

r-jrtlOGFP  AANDENYALAA 

t  t  t  t 

Tap  protease 


Tumor  (+)  ->  LB  (-)  ->  Tumor  (+) 


SSC-W  (*  1*000)  GFP- A 

Figure  2.  Identification  of  fluorescent  bacteria  by  FACS 


Injected  tumoi 


Day  2 

Figure  3.  Promoter  activation  after  intra-tumor  injection 
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Table  2:  Interqenic  regions  that  induce  higher  GFP 

expression  in  tumor  than  in  spleen 

Median  of  experiment  versus  input  library _ 


-22  E  E  E 


a>  O) 

o  £ 


§  -2 
o  o 


libl  Iib2  Iib3  Iib4 

Sequenced  clones: 


86 


llR  STM0468  -  STM0469 


| IR  STM0474  -  STM0475 
STM 0475 


ylaB 

ybaJ 


10 


11 


26 


28 


36 


44 


45 


61 


78 


STM  0580 
| IR  STM0580  -  STM0581  STM0580 

STM  0844 

IR  STM 0844  -  STM 0845  pfIE 
STM 0845 

STM0937 

IR  STM 0937  -  STM0938  hep 
STM  0938 

STM1382 

| IR  STM1382  -  STM1383  orf408 
| IR  STM1529  -  STM1530  STM1529 
STM1807 

IR  STM1807  -  STM1808  dsbB 
STM1808 

STM1914 

IR  STM1914  -  STM1915  flhB 
STM1915 

STM1996 

| IR  STM1996  -  STM1997  cspB 
| IR  STM2035  -  STM2036  cbiA 
| IR  STM2261  -  STM 2262  napF 
STM2309 

IR  STM2309  -  STM2310  menD 
STM2310 

STM  3070 

| IR  STM3070  -  STM3071  epd 
STM3106 

IR  STM3106  -  STM3107  ansB 
STM3107 

| IR  STM3525  -  STM3526  glpE 
STM  3526 

STM  3880 

IR  STM3880  -  STM3881  kup 
STM3881 

STM4289 

| IR  STM4289  -  STM4290  phnA 

| IR  STM4418  -  STM4419  STM4418 
STM4419 

| IR  STM4430  -  STM4431  STM4430 
STM4431 


+  rpmE2  + 

1  -  acrB 

-  STM0581  + 

moeB 

-  1  -  1  ybjE 
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Sequencing  of  Promoters.  192 

clones  from  the  tumor  library  were  picked 
at  random,  sequenced,  and  100  different 
sequences  were  obtained.  These  sequences 
were  mapped  to  the  genome  and  their 
potential  regulation  (tumor-specific  or 
active  in  both  spleen  and  tumor)  was 
determined  by  comparison  with  the 
microarray  data.  We  found  22  candidate 
promoters  preferentially  activated  in 
tumors  and  40  candidates  constitutive 
promoters.  Tumor-specific  clones 
recovered  in  this  experiment  represent 
23%  of  the  total  95  tumor-expressed 
intergenic  regions  detected  on 
microarrays.  Table  2  includes  promoter 
fragments  that  were  cloned  that  showed 
differential  activity  on  the  array  assay. 


Confirmation  of  tumor 
specificity  of  individual  clones  in  vivo. 

Twenty-two  tumor-specific  candidates 
were  recovered;  of  these  three  were 
individually  confirmed  in  vivo.  The  clones 
were  intravenously  injected  at  5x1 06, 1x10' 
and  5x10  efu  into  tumor-free  and  tumor¬ 
bearing  nude  mice.  One  or  two  days  post¬ 
injection,  spleens  and  tumors  were  imaged 
using  the  OV100,  homogenized,  and  the 
bacterial  titer  was  quantified  on  LB  +  Amp 
plates.  Spleens  from  normal  mice  were 
compared  with  tumors  that  had  similar 
bacteria  counts,  so  that  any  difference  in 
fluorescence  would  be  attributable  to 
increased  GFP  expression  rather  than 
bacterial  numbers.  Figure  4  presents 
images  that  indicate  that  the  tumors  are 
much  more  fluorescent  than  spleens 
infected  with  the  same  number  of  bacteria 
for  each  of  the  three  clones.  Contrary  to 
these  putative  tumor-specific  clones,  a 
positive  control  that  constitutively 
expresses  TurboGFP  resulted  in  strong 
fluorescence  in  spleen  even  with  doses  as 
low  as  2x10s  cfu.  An  example  is  shown  in 
figure  4  for  promoter  clone  Cl  9. 
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Figure  4  GFP-based  promoter  expression  in  tumors  and  normal  tissues  in 
nude  mice  using  the  whole  mouse  OVIOO  imaging  system.  Promoter  clone 
Cl  9  is  expressed  in  tumors  (GFP  positive)  but  not  in  spleens  (GFP  negative),  a 
constitutive  GFP  promoter  pturbo  (control),  is  activated  in  both  tissues. 


Intact  tumor 


tumor 


spleen 


liver 


C19 


Control 


Regulatory 
pathways  for  promoters 

preferentially  induced  in 

tumors.  Promoters 

regulated  by  anaerobiosis 
are  likely  to  be  induced  in 
the  hypoxic  regions  of  solid 
tumors  and  most  of  them 
are  under  control  of  the 
Salmonella  global  regulators 
Fnr  and  ArcA  (/).  There  are 
at  least  22  candidate 
promoters  of  this  class 
among  the  95  tumor- 
specific  intergenic  regions 
identified  on  arrays  (data 
not  shown);  two  of  the 
anaerobic  induced 

promoters  are  shown  in  Figure  4.  Clone  10  is  the  promoter  region  of  a  putative  pyruvate  formate  lyase 
activating  enzyme  (pflE)  and  the  promoter  region  of  pflE  contains  Fnr  regulated  sequence.  In  E.  coll,  the 
anaerobic  transcription  of  the  next  gene  (pflF)  is  co-regulated  by  two  major  global  regulators  of  anaerobic 
metabolism,  ArcA  and  Fnr  (/).  Clone  45  contains  the  promoter  region  of  ansB  which  encodes  part  of 
asparaginase,  a  tetrameric  enzyme  that  catalyzes  the  hydrolysis  of  asparagine  to  aspartic  acid  and  ammonia.  In 
E.  coll  ansB  is  positively  co-regulated  by  CRP  (cyclic  AMP  receptor  protein)  and  the  Fnr  protein  (/).  However, 
in  Salmonella  enterica  the  anaerobic  regulation  of  the  ansB  gene  may  require  only  CRP  (/).  Clone  28  contains  the 
promoter  region  of flhB,  a  gene  that  is  required  for  the  formation  of  the  rod  structure  of  the  flagellar  apparatus 
(/).  This  candidate  promoter  and  many  others  identified  on  arrays  are  not  known  to  be  induced  by  hypoxia. 
Some  of  these  promoters  may  be  induced  by  a  different  signal  present  in  subcutaneous  tumors. 

Transition  to  the  second  year.  In  aim  2,  task  2,  below,  we  will  discuss  further  improvements  in  the 
approach  and  the  development  of  tools  for  those  approaches.  We  also  want  to  repeat  experiments  in 
orthotopic  models. 


40 

*  *  , 

(year  1). 


Aim  2.  Task  2.  To  identify  promoters  that  respond  to  anoxia  and/or  acid  pH,  or  neither 


We  subjected  our  Salmonella  strain  to  anoxia  and  to  various  pHs  and  grew  them  to  stationary  phase.. 
This  was  done  in  triplicate.  Then  the  samples  were  subjected  to  FACS  sorting.  The  resulting  material  was 
applied  to  Nimblegen  arrays  (tiling  arrays  of  387,000  overlapping  oligos  in  both  strands  of  the  Salmonella 
genome).  No  tools  existed  for  analyzing  this  kind  of  data  and  indeed,  no  tools  other  than  ours  have  been 
developed  since  that  time.  So  we  embarked  on  developing  those  tools.  At  the  end  of  the  first  year,  these  tools 
were  still  in  raw  form.  The  accompanying  figure  shows  the  power  of  the  data  using  one  type  of  presentation 
from  the  first  tool  developed.  Tabular  calculations  identified  tens  of  differentially  induced  promoters  which 
will  be  the  topic  of  the  next  annual  report.  At  least  ten  promoters  with  responsiveness  to  pH  and  anoxia  will 
be  under  investigation  for  year  2.  Of  special  interest  are  those  promoters  that  were  found  in  surveys  of  tumor 
versus  spleen  (Task  1),  and  particularly  if  these  are  under  different  regulation  in  vitro.  Combinations  of  such 
promoters  could  lead  to  tight  regulation  of  therapeutics  only  in  anoxia  +  low  pH,  for  example. 
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Figure  5  shows  some  of  visualization  efforts.  Note  that  promoter  data  has  never  been  presented  at 
this  resolution,  and  this  comprehensive,  and  in  this  user-friendly  form  before.  In  future  reporting  periods  we 
hope  to  convert  animal  promoter  experiments  to  the  same  analysis  pipeline. 


RATO 


Luria  broth 

pH  7.5 


Luria  Broth 

pH  5.5 


X  broth 

Aerobic 


X  broth 

Anaerobic 


Figure  5.  A  comprehensive  survey  indicates  differentially  activated  promoters.  This  example  is  a  promoter  that 
differs  between  growth  conditions.  In  this  figure  0.2%  of  the  entire  genome  is  presented  in  the  X  axis.  Blue  indicates  genes  in  the 
sense  strand  and  red  in  the  antisense  strand.  Gene  starts  are  upright  and  inverted  triangles.  The  strand  of  the  captured  promoter 
is  also  presented  in  the  same  colors.  The  Y  axis  represents  the  log2  of  the  ratio  of  the  input  library  to  the  FACS  sorted  library.  The 
region  indicated  by  a  box  shows  a  promoter  that  is  four-fold  to  eight-fold  more  active  in  Luria  broth  than  in  X-broth  (a  media  often 
used  for  anaerobic  growth). 

Development  of  new  tools  for  the  accomplishment  of  the  tasks  in  this  project.  The  analysis  of 
the  data  in  Aim  2,  tasks  1  and  2  required  a  considerable  amount  of  bioinformatics  development.  Mapping  of 
promoters  to  microarrays  is  a  non-trivial  task  The  work  for  the  figure  above  was  not  ready  for  publication  as 
a  tool,  but  many  intermediate  steps  were  ready.  This  work  is  described  in  the  attached  manuscripts;  Xia  et  al, 
WebArrayDB:  cross-platform  microarray  data  analysis  and  public  data  repository.  Submitted.  See  appendix; 
and  in  Wang  et  al.,  Analyzing  microarray  data  using  WebArray.  Submitted.  See  appendix.  In  brief,  an  open 
source  integrated  microarray  database  and  analysis  suite,  WebArrayDB  (http:/ /www.webarraydb.org),  was 
developed  that  features  convenient  uploading  of  data  for  storage  in  a  MIAME  (Minimal  Information  about  a 
Microarray  Experiment)  compliant  fashion,  and  allows  data  to  be  mined  with  a  large  variety  of  R-based  tools, 
including  data  analysis  across  multiple  platforms.  Different  methods  for  probe  alignment,  normalization  and 
statistical  analysis  were  included  to  account  for  systematic  bias.  Student's  t-test,  moderated  t-tests,  non- 
parametric  tests  and  analysis  of  variance  or  covariance  (ANOVA/ANCOVA)  are  among  the  choices  of 
algorithms  for  differential  analysis  of  data.  Users  also  have  the  flexibility  to  define  new  factors  and  create  new 
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analysis  models  to  fit  complex  experimental  designs.  All  data  can  be  queried  or  browsed  through  a  web 
browser.  The  computations  can  be  performed  in  parallel  on  symmetric  multiprocessing  (SMP)  systems  or 
Linux  clusters.  The  software  package  is  available  for  the  use  on  a  public  web  server 
(http:/ /www.webarraydb.org)  or  can  be  downloaded  at  Bioinformatics  online. 

We  have  spent  considerable  effort  to  improve  oligo  selection  for  arrays  (Xia  et  al.,  Evaluating 
oligonucleotide  properties  for  DNA  microarray  probe  design.  Submitted,  see  Appendix).  In  brief.  Most 
current  microarray  oligonucleotide  probe  design  strategies  are  based  on  probe  design  factors  (PDFs),  that 
include  probe  hybridization  free  energy  (PHFE),  probe  minimum  folding  energy  (PMFE),  dimer  score, 
hairpin  score,  homology  score,  and  complexity  score.  The  impact  of  these  PDFs  on  probe  performance  was 
evaluated  using  four  sets  of  microarray  comparative  genome  hybridization  (aCGH)  data,  which  included  two 
array  manufacturing  methods  and  the  genomes  of  two  species;  Salmonella  and  humans.  We  developed  a  new 
probe  design  factor,  pseudo  probe  binding  energy  (PPBE),  by  iteratively  fitting  di-nucleotide  positional 
weights  and  di-nucleotide  stacking  energies  until  the  average  residue  sum  of  squares  (ARSS)  for  the  model 
was  minimized.  PPBE  showed  a  better  correlation  with  probe  sensitivity  and  a  better  specificity  than  all  other 
PDFs.  The  physical  properties  that  are  measured  by  PPBE  are  as  yet  unknown  but  include  a  platform- 
dependent  component.  Programs  and  correlation  parameters  from  this  study  are  freely  available  to  facilitate 
the  design  of  DNA  microarray  oligonucleotide  probes. 

Using  these  tools  we  are  able  to  generate,  store,  analyze,  and  present  data  in  an  expeditious,  useful, 
and  attractive  manner. 

Aim  2.  Task  3.  To  test  individual  candidate  promoters  for  differential  activity  in  tumors  (year  2). 

In  mid  2009,  plans  were  in  place  to  screen  candidate  promoters  and  this  task  for  year  two  will  be 
reported  in  the  year  two  annual  report. 

Aim  3.  To  combine  the  best  mutant  strains  with  the  best  tumor-specific  promoters  (year  2). 

In  mid  2009,  plans  were  in  place  to  screen  candidate  promoters  and  this  task  for  year  two  will  be 
reported  in  the  year  two  annual  report. 

Aim  4.  To  test  one  potential  therapeutic  delivery  system  as  a  proof  of  principle  (as  time  permits). 

Although  this  step  was  beyond  the  scope  of  the  proposal  and  was  not  a  formal  task,  we  have  signed 
MTAs  and  acquired  the  vectors  needed  for  this  step.  It  is  not  likely  we  will  perfonn  this  optional  task  until 
year  3,  at  the  earliest. 

KEY  RESEARCH  ACCOMPLISHMENTS: 

•  Identification  of  a  few  candidate  genes  that  when  mutated  alter  the  targeting  of  Salmonella  to  cancers. 

•  Identification  of  over  50  candidate  Salmonella  genes  that  when  mutated  allow  growth  in  tumors  while 
debilitating  virulent  growth  in  the  spleen. 

•  Identification  of  over  50  candidate  Salmonella  promoter  regions  that  are  preferentially  activated  in 
tumors  but  not  in  the  spleen. 

•  Identification  of  tens  of  promoters  in  vitro  responsive  to  anoxia  and  pH  that  potentially  correlate 
with  conditions  in  the  tumor. 
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•  Improvements  in  data  analysis  software  with  the  continued  updating  of  www.webarravDB.  a  public 
resource  that  we  maintain  so  that  it  can  handle  new  kinds  of  data. 

REPORTABLE  OUTCOMES: 


Abstracts  presented: 

Nabil  Arrach,  Ming  Zhao,  Steffen  Porwollik,  Robert  M.  Hoffman  and  Michael  McClelland  (2008) 
Salmonella  promoters  preferentially  activated  inside  tumors.  Annual  meeting  of  the  American  Association  for 
Cancer  Research,  San  Diego,  California,  USA 

Nabil  Arrach,  Ming  Zhao,  Robert  M.  Hoffman  and  Michael  McClelland  (2009)  Microarray  screening 
of  Salmonella  variants  for  tumor  targeting.  Annual  meeting  of  the  American  Association  for  Cancer  Research, 
Denver,  Colorado,  USA 

Submitted: 

Xia  XQ,  Jia  Z,  Porwollik  S,  Long  F,  Hoemme  C,  Ye  K,  Muller-Tidow  C,  McClelland  M, 
Wang  Y.  Evaluating  oligonucleotide  properties  for  DNA  microarray  probe  design. 

Xia  XQ,  McClelland  M,  Porwollik  S,  Song  W,  Cong  X,  Wang  Y.  WebArrayDB:  cross¬ 
platform  microarray  data  analysis  and  public  data  repository. 

Wang  Y,  McClelland  M,  Xia  XQ.  Analyzing  Microarray  Data  Using  WebArray. 

Planned  or  in  preparation: 

Santiviago  CA,  Reynolds  MM,  Porwollik  S,  Choi  SH,  Long  F,  Andrews-Polymenis  HL, 
McClelland  M.  Analysis  of  pools  of  targeted  Salmonella  deletion  mutants  identifies  novel  genes 
affecting  fitness  during  competitive  infection  in  mice. 

Wang  Y,  Xia  XQ,  Zhenyu  Jia  Z,  Anne  Sawyers  A,  Yao  H,  Wang-Rodriquez  J,  Mercola  D, 
McClelland  M.  In  silico  estimates  of  tissue  components  in  surgical  samples  based  on  expression 
profiling  data. 

Xia  XQ,  McClelland  M,  Wang  Y.  TabSQL:  a  MySQL  tool  to  facilitate  mapping  user  data  to 
public  databases. 

Arrach  N,  Cheng  P,  Zhao  M,  Santiviago  CA,  Hoffman  RM,  McClelland  M.  High-throughput 
screening  for  Salmonella  avimlent  mutants  that  retain  targeting  of  solid  tumors. 

Patent  applications  filed: 

PCT  VTV-1001-PC,  Methods  to  treat  solid  tumors  (see  appendix). 

Informatics  and  databases: 

Improvements  to  www.webarrayDB.org  and  to  oligo  selection  methods  for  arrays. 
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Improvements  to  databasing;  increased  capacity  and  ease  of  use. 

CONCLUSION: 


In  Aim  1,  task  1,  the  investigators  have  identified  over  50  genes  that  share  the  desirable  feature  of 
rendering  Salmonella  less  virulent  for  infection  but  which  still  retain  the  ability  to  target  tumors  and  grow  in 
tumors.  The  ability  to  target  tumors  after  oral  delivery  was  also  demonstrated.  This  sets  the  stage  for  Aim  1, 
Task  2,  in  years  2  and  3,  in  which  we  will  test  individual  mutants  for  avirulence  and  tumor-selective  properties. 

In  Aim  2,  task  1,  over  50  candidate  promoters  were  identified  that  were  induced  in  tumor,  and  may 
be  less  induced  in  other  parts  of  the  animal  host  In  Aim  2,  task  2,  all  the  experiments  on  anoxia  and  pH  were 
completed.  A  few  genes  induced  preferentially  in  these  conditions  are  under  investigation.  Tools  were 
developed  to  present  this  data  and  have  been  made  publicly  available.  This  sets  the  stage  for  Aim  2,  Task  3,  in 
years  2  and  3,  in  which  we  will  test  individual  candidate  promoters  for  differential  activity  in  tumors. 
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Abstract 

Salmonella  typhimurium  has  the  ability  to  target  a  wide  range  of  solid  tumors  and 
accumulates  thousands  of  folds  in  tumors  when  compared  to  normal  tissues.  Only  a 
handful  of  attenuated  Salmonella  strains  are  currently  being  investigated  for 
cytokine  delivery  or  gene  directed  enzyme  pro-drug  therapy.  There 
remains  considerable  scope  for  engineering  low  toxicity  and  improved  targeting  to 
tumors  in  humans.  A  high  throughput  screening  of  a  complex  Salmonella  mutant 
library  was  performed  in  human  prostate  tumors,  melanomas,  and  normal  tissues  in 
nude  mice.  Microarrays  were  used  to  identify  Salmonella  variants  that  have 
reduced  fitness  in  normal  tissues  (for  safety)  but  still  thrive  in  tumors  (unchanged 
fitness  or  even  increased  fitness).  Our  data  reveal  that  some  Salmonella  mutants 
previously  used  for  cancer  therapy,  such  as  aroA  and  aroD  are  very  safe,  but  at  a 
disadvantage  for  growth  in  tumors.  Screening  for  optimized  safe  strains  can  be  applied 
to  multiple  animal  models  to  ensure  the  generality  of  the  findings,  potentially 
improving  safety  and  targeting  of  cancers  in  humans. 
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Salmonella  promoters  preferentially  activated  inside  tumors 
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Salmonella  has  the  ability  to  preferentially  grow  in  the  hypoxic  environment  of  solid 
tumors  and  has  previously  been  used  to  express  therapeutic  proteins.  We  have  recently 
developed  a  strain  of  S.  typhimurium  which  preferentially  targets  viable  tumor  tissue 
as  well  as  necrotic  tissue  (Proc.  Natl.  Acad.  Sci.  USA  104,  10170-10174,  2007). 
However,  bacteria  still  circulate  at  low  levels  in  the  body.  Control  of  protein 
expression  usingendogenous  Salmonella  promoters  that  are  preferentially  activated  in 
tumors  could  further  improve  targeting  of  therapies.  A  random  library  ofSalmonella 
enterica  Typhimurium  14028  genomic  DNA  was  cloned  upstream  of  a  promoter-less 
green  fluorescent  protein  gene  (TurboGFP)  and  intravenously  injected  into  tumor- free 
mice  and  into  human  PC3  prostate  tumors  growing  subcutaneously  in  nude  mice. 

After  two  days,  fluorescence-activated  cell  sorting  was  used  to  enrich  for  bacterial 
clones  expressing  GFP  in  spleens  or  in  tumors.  The  resulting  libraries  were  hybridized 
to  anoligonucleotide  tiling  array  of  the  Salmonella  genome.  95  intergenic  regions  were 
enriched  in  tumor  samples  but  not  in  spleen.  Sequencing  of  100  clones  from  a  tumor- 
enriched  library  yielded  22  from  intergenic  regions  that  showed  significant  enrichment 
in  tumors  versus  spleen  in  the  microarrays.  Three  of  these  22  candidate  promoter 
clones  were  tested  in  vivo  and  enhanced  GFP  expression  in  tumor  relative  to  spleen 
was  confirmed.  Two  of  the  three  clones  mapped  to  the  pflE  and  ansB  promoter 
regions,  which  are  known  to  undergo  induction  in  the  hypoxic  conditions  that  occur  in 
solid  tumors.  Most  of  the  other  93  candidates  are  not  known  to  be  regulated  by 
hypoxia  and  some  may  reveal  other  properties  of  tumors  exploited  by  Salmonella.  The 
expression  of  therapeutics  in  Salmonella  under  the  regulation  of  one  or  more 
promoters  that  are  activated  preferentially  in  tumors  has  the  potential  for  tumor- 
targeted  therapy  with  reduced  side-effects. 
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Abstract 

Cross-platform  microarray  analysis  is  an  increasingly  important  research  tool,  but  researchers 
still  lack  open  source  tools  for  storing,  integrating,  and  analyzing  large  amounts  of  microarray  data 
obtained  from  different  array  platforms.  An  open  source  integrated  microarray  database  and  analy¬ 
sis  suite,  WebArrayDB  (http://www.webarraydb.org),  has  been  developed  that  features  convenient 
uploading  of  data  for  storage  in  a  MIAME  (Minimal  Information  about  a  Microarray  Experiment) 
compliant  fashion,  and  allows  data  to  be  mined  with  a  large  variety  of  R-based  tools,  including 
data  analysis  across  multiple  platforms.  Different  methods  for  probe  alignment,  normalization  and 
statistical  analysis  are  included  to  account  for  systematic  bias.  Student’s  t-test,  moderated  t-tests, 
non-parametric  tests,  and  analysis  of  variance  or  covariance  (ANOVA/ANCOVA)  are  among  the 
choices  of  algorithms  for  differential  analysis  of  data.  Users  also  have  the  flexibility  to  define  new 
factors  and  create  new  analysis  models  to  fit  complex  experimental  designs.  All  data  can  be  queried 
or  browsed  through  a  web  browser.  The  computations  can  be  performed  in  parallel  on  symmetric 
multiprocessing  (SMP)  systems  or  Linux  clusters. 

[WebArrayDB  is  freely  available  at  http://www.webarraydb.org.] 


Introduction 

Large  amounts  of  microarray  experimental  data  are 
stored  in  public  repositories,  making  cross-platform 
analysis  of  data  from  different  sources  (either  dif¬ 
ferent  laboratories  and/or  different  platforms)  an 
increasingly  attractive  and  important  research  tool 
[Moreau  et  ah,  2003].  Such  analyses  are  possible  be¬ 
cause  biological  treatments  usually  have  a  greater 
impact  on  measured  expression  than  the  noise  of  a 
cross-platform  analysis  [Chen  et  al.,  2008,  Larkin 
et  al.,  2005,  Shippy  et  al.,  2004].  Moreover,  the 
combined  use  of  multiple  platforms  can  overcome 
the  inherent  biases  of  individual  platforms  for  iden¬ 
tification  of  the  more  robust  changes  in  gene  expres¬ 
sion  profiles  [Bosotti  et  al.,  2007]. 

Currently  available  analysis  packages  do  not 
provide  all  the  required  functions  for  cross-platform 
integration,  normalization,  and  statistical  analy¬ 
sis  of  data  from  different  sources.  Integrative  Ar¬ 
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ray  Analyzer  (iArray)  [Pan  et  al.,  2006]  offers  sta¬ 
tistical  cross-platform  analysis  functions  but  does 
not  have  probe  alignment  or  data  normalization 
features.  MatchMiner  [Bussey  et  al.,  2003]  is  a 
powerful  tool  for  matching  genes  and  gene  prod¬ 
ucts  from  two  platforms  but  is  not  designed  for 
statistical  analysis.  The  Gene  Expression  Pat¬ 
tern  Analysis  Suite  (GEPAS)  [Tarraga  et  al.,  2008] 
integrates  many  tools  for  microarray  data  analy¬ 
sis,  but  it  does  not  have  data  storage  capability 
or  cross-platform  analysis  functions.  Other  on¬ 
line  platforms  and  public  repositories  are  designed 
mainly  for  data  storage  and  lack  probe  matching 
and  cross-platform  analysis  functions:  prominent 
examples  include  Expression  Profiler  [Kapushesky 
et  al.,  2004],  Array  Express  [Parkinson  et  al.,  2007], 
the  Stanford  Microarray  Database  (SMD)  [Demeter 
et  al.,  2007],  the  Longhorn  Array  Database  (LAD) 
[Killion  et  al.,  2003]  and  the  BioArray  Software  En¬ 
vironment  (BASE)  [Saal  et  al.,  2002,  Troein  et  al., 


1 


name  and  password  (Figure  1),  enabling  data  ac¬ 
cess  to  be  controlled  based  on  user  privileges.  Ev¬ 
ery  project  has  an  associated  release  date  which  de¬ 
termines  the  public  accessibility  of  the  project.  By 
default  the  project  release  date  will  be  two  years 
from  the  data  deposit  date  to  protect  data  privacy. 

The  user  can  change  the  release  date  at  the  time 
the  data  is  deposited  or  at  any  time  thereafter. 

WebArrayDB  is  powered  by  the  affy  [Gautier 
et  ah,  2004]  and  the  Linear  Models  for  Microarray 
Data  (LIMMA, 

http://bioinf.wehi.edu.au/limma)  [Smyth,  2005] 
packages  from  bioconductor  (http:/ /www. bioconductor.org/), 
which  are  open  source  and  open  development  soft¬ 
ware  projects  for  the  analysis  and  comprehension 
of  genomic  data.  Thus,  many  different  formats  of 
intensity  files  are  recognized,  including  data  from 
Affymetrix  CEL  files,  Agilent  Feature  Extraction, 
ArrayVision,  BlueFuse,  GenePix,  QuantArray  (Ver¬ 
sion  3  or  later),  SMD  and  SPOT.  Any  formats  that 
affy  and  LIMMA  do  not  recognize  can  be  accepted 
when  defined  by  the  user  in  a  tab-delimited  text  file, 
including  data  with  more  than  2  scanned  channels. 

WebArrayDB  stores  parsed  data  in  database  ta¬ 
bles.  The  image  files,  intensity  files,  probe  files,  pro¬ 
tocol  files  and  other  user-supplied  raw  data  files  are 
stored  in  the  file  system  on  servers  with  indices  in 
the  database. 

Data  Analysis 


2006], 

An  earlier  open-source  online  platform  for  mi¬ 
croarray  data  analysis,  WebArray  [Xia  et  ah,  2005], 
did  not  offer  a  cross-platform  analysis  function, 
but  provided  an  excellent  framework  for  extension 
to  WebArrayDB  (http://www.webarraydb.org)  -  a 
database  system  and  analysis  suite  that  provides 
this  function.  In  addition  to  traditional  methods 
such  as  median  and  quantile  for  between-array  nor¬ 
malization,  WebArrayDB  has  integrated  median 
rank  scores  (MRS),  quantile  discretization  (QD) 
[Warnat  et  ah,  2005],  gene  quantile  (GQ)  -  a  quan¬ 
tile  normalization  for  each  individual  gene  among 
different  platforms,  and  principal  component  anal¬ 
ysis  (PCA)  [Stoyanova  et  ah,  2004].  WebAr¬ 
rayDB  provides  standard  statistical  analysis  meth¬ 
ods,  such  as  Student’s  t-test,  eBayes-moderated  t- 
test,  Significance  Analysis  of  Microarrays  (SAM) 
[Tusher  et  ah,  2001],  ANOVA/ANCOVA  and  non- 
parametric  tests,  as  options  for  users  to  explore. 

Database  Infrastructure 

WebArrayDB  includes  all  fields  required  for 
MIAME-compliant  microarray  data  storage 
[Brazma  et  ah,  2001].  Data  are  classified  into  five 
categories:  “project”,  “array”,  “platform”,  “pro¬ 
tocol”  and  “sample”.  Each  record  in  these  tables 
is  given  a  unique  ID  (“MPMDB  ID”),  and  all  five 
categories  have  to  be  filled  for  MIAME  compliance 
and  subsequent  data  analysis.  All  tables  in  the 
database  have  been  indexed  to  speed  up  queries 
even  when  the  size  of  the  data  set  becomes  very 
large. 

The  project  table  serves  as  the  hub  of  infor¬ 
mation  -  most  information  is  linked  to  a  specific 
project  in  the  database  (Figure  1  and  Sup¬ 
plementary  Figure  1).  Intrinsic  relationships 
among  project,  array,  platform,  protocol,  and  sam¬ 
ple  are  directly  linked  by  references  between  tables, 
which  permits  fast  cross-table  searching.  When 
defining  a  platform,  users  may  supply  probe  in¬ 
formation,  including  user-defined  IDs  and  gene  IDs 
from  other  public  databases,  such  as  RefSeq,  Uni- 
Gene,  etc.  All  of  these  IDs  can  serve  as  references 
for  cross-platform  probe  alignment.  Since  there 
are  extensive  gene  annotations  in  GO  (Gene  On¬ 
tology  database,  http://www.geneontology.org/) 
[Ashburner  and  Lewis,  2002]  ,  WebArrayDB  is  also 
designed  to  facilitate  the  use  of  GO  for  probe 
searching.  The  GO  database  in  WebArrayDB  is 
updated  monthly. 

The  project  table  is  linked  to  the  “users”  table 
that  contains  the  user  information  including  user 


Data  queried  from  the  database  can  be  directly  sub¬ 
jected  to  analysis.  WebArrayDB  presents  a  vari¬ 
ety  of  options  for  data  preprocessing,  and  differen¬ 
tial  analysis.  Conservative  default  analysis  methods 
and  parameters  are  set  so  that  novice  users  will  be 
less  likely  to  use  flawed  analysis  strategies. 


Data  preprocessing  includes  cross-platform  probe 
alignment,  background  correction  and  normaliza¬ 
tion.  For  cross-platform  analysis,  the  primary  con¬ 
cern  is  how  to  match  probes  from  different  plat¬ 
forms.  Based  on  the  intrinsic  relationships  between 
platforms,  we  offer  three  approaches  to  this  issue. 

•  Direct  match 

Direct  match  is  used  when  all  probes  are  iden¬ 
tical  across  microarray  platforms. 

•  Match  by  reference  IDs 

Probes  from  two  different  platforms  can  be 
aligned  if  they  share  the  same  reference  ID. 


Data  preprocessing 
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IDs  from  well-known  public  databases,  for  ex¬ 
ample,  UniGene  ID  or  Ensembl  ID,  can  serve 
as  reference  ID’s,  as  can  any  user-defined  cat¬ 
egory. 

•  Match  by  file 

Users  can  align  probes  by  providing  a  probe¬ 
mapping  file,  in  which  homologous  probes  are 
explicitly  mapped. 

If  multiple  platforms  are  involved,  normalization 
within  or  between  arrays  of  the  same  platform  can 
be  done  directly  on  the  raw  data  before  probe  align¬ 
ment.  After  alignment,  the  whole  data  set  can  be 
normalized. 

Differential  analysis 

Users  can  analyze  data  based  on  either  ratio  or  in¬ 
tensity.  The  ratio-based  model  is  R  =  p  +  s,  where 
R  is  the  ratio,  p  represents  the  intercept  of  the  ra¬ 
tio  of  the  two  groups  and  e  represents  the  Gaussian 
random  error.  We  say  two  samples  are  different  if 
p  significantly  differs  from  the  null  hypothesis. 

More  than  one  comparison  among  groups  of 
data  can  be  requested  simultaneously.  Further¬ 
more,  users  may  apply  “+”,  and  parentheses 
to  make  more  specific  comparisons.  For  instance, 
given  four  groups,  “( groupl  +  group2 )  —  {group‘d  + 
groupA )”  computes  the  global  difference  between 
array  data  supplied  in  the  first  two  groups  compared 
to  array  data  supplied  in  the  second  two  groups. 

Fold-change  analysis,  Student’s  t-test,  eBayes- 
moderated  t-test  [Smyth,  2004,  Smyth  et  al.,  2005], 
SAM  test  [Tusher  et  al.,  2001],  non-parametric  tests 
(including  Wilcoxon  rank  sum  test,  Kruskal- Wallis 
rank  sum  test  and  Friedman  rank  sum  test)  and 
ANOVA/ANCOVA  are  among  the  choices  of  algo¬ 
rithms  for  differential  analysis  of  data  in  WebAr- 
rayDB. 

Mixed-effect  model  ANOVA  plays  a  very  im¬ 
portant  role  in  microarray  data  analysis  [Churchill, 
2002].  ANOVA  is  capable  of  dealing  with  multiple 
factors.  The  default  model  in  WebArrayDB  is 

E  =  p  +  G  +  P  +  A-t--D-|-j5'  +  /  +  £  (1) 

where  E  is  the  observed  log-transformed  intensity 
value,  fi  is  the  theoretical  “real”  log-transformed 
intensity  value,  e  represents  the  Gaussian  random 
error  with  0  as  expected  value,  and  G  is  the  group 
factor,  which  leads  to  effects  of  interest,  e.g.  treat¬ 
ment  effects.  P,  A,  D,  S  and  I  represent  effects  of 
platform ,  array ,  dye,  sample  and  individual  respec¬ 
tively,  among  which  array  and  individual  are  con¬ 
sidered  random  effect  factors.  Based  on  the  data  to 


be  analyzed,  more  or  fewer  factors  might  be  used  in 
specific  analysis  processes. 

Experienced  users  can  define  new  factors  and 
create  complicated  analysis  models.  This  enables 
WebArrayDB  to  analyze  data  from  virtually  any 
experimental  design  and  thereby  to  retain  relevance 
as  methods  continue  to  evolve. 

Other  analysis  tools 

Both  raw  and  differentially  analyzed  data  can  be 
used  for  further  analysis,  including  hierarchical 
clustering,  correspondence  analysis,  between  group 
analysis,  and  plotting  using  genome  position.  A  va¬ 
riety  of  high-quality  charts  in  PDF  and  EPS  formats 
can  be  produced  to  visualize  analysis  results. 

Example 

Data  sources 

A  demonstration  of  a  cross-platform  analysis  is 
used  as  a  training  example  in  every  WebArray 
account.  This  example  uses  two  publicly  avail¬ 
able  prostate  cancer  microarray  data  sets.  One 
set  was  obtained  using  a  custom  made  cDNA  mi¬ 
croarray  (20K  chip,  platform  MPMDB  ID:42)  that 
contains  19,947  sequence  verified  PCR-amplified 
human  cDNAs  representing  15,495  UniGene  clus¬ 
ters  [Dhanasekaran  et  al.,  2005]  (project  MP¬ 
MDB  ID:76).  The  other  was  obtained  using  a 
commercially  available  oligonucleotide  microarray 
(Affymetrix  U95A  array,  platform  MPMDB  ID:9) 
that  contains  12,626  probe  sets  consisting  of  25-base 
oligonucleotide  probes  [Welsh  et  al.,  2001]  (project 
MPMDB  ID:78).  From  the  two  data  sets,  49  tu¬ 
mor  samples  (prostate  cancer)  and  21  non-tumor 
samples  are  analyzed  in  this  example. 

Options  for  analysis 

Analysis  options  selected  for  this  demonstration  are 
illustrated  in  Figure  2.  The  IDs  from  the  UniGene 
database  (http:/ /www. ncbi.nlm.nih.gov/UniGene) 
are  used  to  match  cDNA  clones  and  Affymetrix 
probe  sets  between  platforms.  Within  each  study, 
the  median  value  is  used  for  expression  values  cor¬ 
responding  to  probes  of  the  same  UniGene  cluster. 
Genes  not  mapping  to  a  UniGene  cluster  present 
in  both  microarray  platforms  are  not  considered  for 
cross-platform  analysis.  For  integration  and  nor¬ 
malization  of  microarray  measurements  from  dif¬ 
ferent  platforms,  we  apply  quantile  discretization 
[Warnat  et  al.,  2005].  A  common  reference  sample 
is  used  in  the  two  color  cDNA  microarray  study 
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and  the  log2  ratios  of  the  intensity  values  from 
experimental  samples  over  the  common  reference 
sample  are  calculated  for  each  individual  array  and 
used  for  further  analysis.  A  non-parametric  analy¬ 
sis  method,  the  Wilcoxon  rank  sum  test,  is  used  for 
differential  analysis. 

Results 

A  total  of  4690  probes  are  identified  as  common 
to  both  datasets,  among  which  661  are  reported  to 
be  differentially  expressed  between  tumor  and  non¬ 
tumor  samples  at  p  <  0.01,  with  267  retained  af¬ 
ter  false  discovery  rate  adjustment  by  the  step-up 
method  of  Bcnjamini-Hochberg  (1995).  Hierarchi¬ 
cal  clustering  is  performed  for  the  top  30  most  sig¬ 
nificant  differential  expressed  gene  sets  (Figure  3). 
Clustering  results  show  that  the  samples  were  sep¬ 
arated  into  two  major  groups  correlating  with  their 
biological  origin  (tumor  vs  non-tumor  instead  of 
their  platforms.  In  general,  discriminative  gene  sets 
found  in  two  data  sets  on  different  platforms  are 
likely  to  be  more  reliably  characteristic  of  tumor 
status  than  the  genes  obtained  from  each  individ¬ 
ual  data  set  [Warnat  et  ah,  2005]. 

Implementation 

WebArrayDB  has  been  implemented  on  a  LAMP 
system  (a  Linux  server  with  Apache,  MySQL 
and  Python)  in  a  typical  browser/server  model 
(Figure  4).  In  a  deployment,  the  WebArrayDB 
web  server,  database  server  and  file  server  can 
be  located  on  a  single  machine  or  on  separate 
machines.  Most  modules  are  written  in  python 
(http://www.python.org),  while  analysis  func¬ 
tions  are  powered  by  R  language  (http://www.r- 
project.org)  [R  Development  Core  Team,  2006]  and 
Bioconductor  [Gentleman  et  ah,  2004],  Our  We¬ 
bArrayDB  is  hosted  on  a  Dell  server  with  4  CPU 
cores  with  hyper-threading  technology,  24GB  of 
RAM,  1  TB  main  hard  disk  and  1  TB  hard  disk 
for  backup.  The  configuration  will  be  upgraded 
depending  on  the  burdens  of  computation  and  in¬ 
creases  in  the  data  stored. 

Parallel  computation  can  be  done  at  two  levels: 

•  Multiple  analysis  requests  from  users  can  be 
processed  simultaneously.  In  order  to  avoid 
too  many  active  requests,  WebArrayDB  will 
automatically  determine  a  maximum  number 
of  requests  that  can  be  processed  simulta¬ 
neously,  limiting  both  the  number  per  user 
and  the  total  number,  while  keeping  other  re¬ 


quests  waiting  in  the  queue.  The  default  val¬ 
ues  can  be  adjusted  by  the  administrator. 

•  Even  in  a  single  analysis  request,  computation 
can  be  distributed  into  many  processes  that 
run  in  parallel.  The  number  of  processes  can 
be  adjusted  by  the  administrator.  The  pack¬ 
age  SNOW  [Rossini  et  al.,  2003]  was  adopted 
for  this  purpose,  so  Message  Passing  Inter¬ 
face  (MPI),  Parallel  Virtual  Machine  (PVM) 
or  SOCKET  can  be  used  for  communication 
in  parallel  computation. 

Although  WebArrayDB  is  presented  as  a  web 
server  on  the  internet,  a  package  is  downloadable 
for  those  who  want  to  build  their  own  dedicated 
servers  with  Win32  or  POSIX  (Portable  Operating 
System  Interface)  on  SMP  systems  or  Linux  clus¬ 
ters.  WebArrayDB  is  designed  as  a  lightweight 
database  with  a  user  friendly  web  interface  facili¬ 
tating  ease  of  use  for  bench  scientists.  Although 
a  curator  is  always  desirable  there  is  no  necessity 
for  one.  WebArrayDB  is  an  ideal  tool  for  individ¬ 
ual  researchers,  laboratories,  or  small  research  in¬ 
stitutes,  to  store,  share  and  analyze  the  microarray 
data.  The  installation  of  the  WebArrayDB  server 
and  maintenance  is  likely  to  require  only  a  few  hours 
of  assistance  of  IT  staff. 

Tutorial  and  examples 

A  web-based  tutorial,  presented  in  English,  Chi¬ 
nese,  and  Spanish  at  the  WebArrayDB  website 
(http://www.webarraydb.org),  shows  how  to  up¬ 
load  data  and  how  to  process  a  simple  example. 
The  input  data  and  analysis  results  used  in  the  tuto¬ 
rial  (simple  analysis)  and  this  manuscript  (complex 
cross-platform  comparison)  are  available  for  view¬ 
ing  by  all  WebArrayDB  users.  Analysis  methods 
other  than  the  preselected  ones  can  be  chosen  for 
these  examples,  and  results  of  these  changes  can 
be  viewed  and  stored  in  the  user-specific  accounts. 
Thus,  all  new  users  have  the  opportunity  to  fa¬ 
miliarize  themselves  with  the  powerful  capabilities 
of  WebArrayDB  by  browsing  and  editing  both  the 
simple  and  the  complex  examples  in  the  “demo” 
account  upon  first  entry  into  the  system. 
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Figure  1:  Information  organization  in  WebArrayDB. 

a)  Data  preprocessing 
Probe  alignment  [7] 

Quick  Jhonment  [?]:  Oyes  S'  No 

Match  probes  from  different  platforms  by  [?]:  |  user-specified  columns  i  \ 

Column 

Human  Genome  U95A  Array::webarray  |  UniGene  ID _ 1 1 

20K::webarray  |  UniGene  ±  \ 

Method  to  use  replicate  probes  [?]:  |  median  t  ! 

Data  normalization  [7] 

Background  correction  and  normalizations  within  platform: 

feateSiiiM'  i  Within  envy-  —BUB 
Human  Genome  U95A  Array: :web array  |  none  C  |  |  none  t  \  |  none  t ! 

20  K:  web  array  |  none  t  |  |  none  t  \  |  none  t  | 

Cross-platform  normalization  method  [?]:  IflL  ...  :tJ  Number  of  bin  [7]:|8  ] 

Options  for  output 

Save  data  in  files?  ®Yes  O  No 

Draw  '-imris  for  flua/it.v  control’  Oyes  ®  No 

h)  Differential  analysis  [7] 

Statistical  method  for  analysis  [?]:  |  Non-parametric  test  i  | 

Data  are  paired/blocked [7]  (  '-'Yes  'Si  No  )  by: 

®  Array  O  Platform  OjSye  '-'Individual  '-'Sample  1  \.<o 
Comparisons  to  make  [?]:  | group2-groupl  | 

Sort  results  by  p  value?  '-'Yes  ®  No 
c)  Other  analysis  tools  [7] 

Define  a  filter  to  screen  differentially  expressed  probes: 

O  all  probes  O  probes  with  p  value  <=  fo.01  ]  first  1 30  |  probes  of  smaller  p  value. 

•  Cluster  data  by  (  Djdata  channels  d  groups) 

•  Output  heatmap  by  (  @  data  channels  D  groups) 

Figure  2:  Options  selected  in  an  analysis  of  two  publicly  available  prostate  cancer  microarray  data  sets. 
See  text  for  details. 


Figure  3:  Heat  map  of  the  30  most  significantly  differentially  expressed  probes  between  tumor  and  non¬ 
tumor  samples. 

The  tumor  samples  are  marked  at  the  top  of  the  plot  by  a  brown  bar  and  the  non-tumor  group  by  a 
yellow  bar.  Arrays  of  the  20K  platform  are  named  in  blue  font  at  the  bottom  of  the  plot,  Affymetrix 
U95A  arrays  in  black  font. 
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Figure  4:  Architecture  of  WebArrayDB 
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Abstract 

Most  current  microarray  oligonucleotide  probe  design  strategies  are  based  on  probe  design  fac¬ 
tors  (PDFs),  which  include  probe  hybridization  free  energy  (PHFE),  probe  minimum  folding  energy 
(PMFE),  dimer  score,  hairpin  score,  homology  score,  and  complexity  score.  The  impact  of  these  PDFs 
on  probe  performance  was  evaluated  using  four  sets  of  microarray  comparative  genome  hybridiza¬ 
tion  (aCGH)  data,  which  included  two  array  manufacturing  methods  and  the  genomes  of  two  species. 
Since  most  of  the  hybridizing  DNA  is  equimolar  in  CGH  data,  such  data  are  ideal  for  testing  the 
generally  hybridization  properties  of  almost  all  candidate  oligonucleotides.  In  all  our  datasets,  PDFs 
related  to  probe  secondary  structure  (PMFE,  hairpin  score  and  dimer  score)  are  the  most  signifi¬ 
cant  factors  linearly  correlated  with  probe  hybridization  intensities.  PHFE,  homology  and  complexity 
score  are  correlating  significantly  with  probe  specificities,  but  in  a  non-linear  fashion.  We  developed  a 
new  probe  design  factor,  pseudo  probe  binding  energy  (PPBE),  by  iteratively  fitting  di-nucleotide  po- 


sitional  weights  and  di-nucleotide  stacking  energies  until  the  average  residue  sum  of  squares  (ARSS) 
for  the  model  was  minimized.  PPBE  showed  a  better  correlation  with  probe  sensitivity  and  a  better 
specificity  than  all  other  PDFs,  although  training  data  are  required  to  construct  PPBE  model  first 
prior  to  designing  new  oligonucleotide  probes.  The  physical  properties  that  are  measured  by  PPBE 
are  as  yet  unknown  but  include  a  platform-dependent  component.  A  practical  way  to  use  these  PDFs 
for  probe  design  is  to  set  cut-off  thresholds  to  filter  out  bad  quality  probes.  Programs  and  correlation 
parameters  from  this  study  are  freely  available  to  facilitate  the  design  of  DNA  microarray  oligonu¬ 
cleotide  probes. 

Key  words:  microarray,  probe  design,  oligonucleotide 

Introduction 

Microarray  technology  surveys  many  thousands  of  genes  to  investigate  gene  expression  [1],  tran¬ 
scription  factor  binding  profiles  [2-5],  DNA  methylation  profiles  [4,6],  comparisons  of  DNA  copy 
number  [5]  and  comparative  genomic  sequencing  [7]. 

Oligonucleotide  probes  provide  higher  hybridization  specificity  than  longer  PCR  products  [8-10]. 
Falling  costs  of  oligonucleotide  synthesis,  along  with  the  development  of  new  microarray  manu¬ 
facture  technologies,  such  as  the  NimbleGen  maskless  array  synthesizer  [11]  and  Agilent’s  ink-jet 
oligonucleotide  synthesizer  make  custom  long  50  bases)  oligonucleotide  arrays  possible  for 
many  experimental  applications.  Optimal  probe  design  algorithms  are  consequently  desirable. 

Hybridization  on  an  array  can  be  explained  by  several  interconnected  processes,  including  the 
affinity  of  a  target  for  a  probe,  the  formation  of  stem-loop  structures  of  a  probe,  the  formation 
of  secondary  structures  (loops  and  helices)  of  a  target,  and  probe-to-probe  dimerization  [12-16]. 
There  are  a  variety  of  factors  governing  these  processes,  including  probe  hybridization  energy 
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(PHFE)  [17],  probe  minimum  folding  energy  (PMFE)  [18],  probe  dimer  and  hairpin  scores  [19], 
as  well  as  homology  and  complexity  scores  [20].  Most  of  the  current  oligonucleotide  probe  design 
software  packages  estimate  these  properties  [20-28]. 

To  systematically  and  quantitatively  study  how  these  factors  influence  probe  performance  in  mi¬ 
croarrays,  we  collected  a  large  amount  of  array  CGH  microarray  data  and  used  these  data  to  eval¬ 
uate  the  utility  of  each  probe  design  factor  (PDF)  for  probe  selection.  Using  aCGH  data,  a  novel 
probe  design  factor,  pseudo  probe  binding  energy  (PPBE),  was  developed.  PPBE  is  more  accurate 
in  predicting  probe  performance  than  all  other  factors  and  can  thus  be  used  for  iterative  improve¬ 
ment  of  the  choice  of  oligonucleotides  on  the  array.  While  the  specific  physical  properties  measured 
by  PPBE  remain  unknown,  they  encompass  platform- specific  parameters. 

Methods 

Microarray  CGH  Data  Sets 

Four  comparative  genome  hybridization  microarray  data  sets  were  used  in  the  study  (Table  1). 
Human  genomic  DNA  (data  sets  1,  2  and  4)  and  Salmonella  genomic  DNA  (data  set  3)  samples 
were  hybridized  to  their  corresponding  arrays.  The  array  platforms  include  NimbleGen  arrays  (3’ 
end  of  oligos  is  linked  to  the  solid  phase)  and  in-house  spotted  oligonucleotide  arrays  (5’  end  of 
oligos  is  linked  to  the  solid  phase).  The  majority  of  probes  on  the  arrays  we  use  are  50  nucleotides 
in  length.  However,  there  are  also  probes  of  different  length,  e.g.,  there  are  9989  of  46-mer  probes 
and  4721  of  55-mer  probes  on  the  array  for  data  set  4.  We  found  that  the  correlations  of  PDFs  to 
probe  sensitivities  for  these  probes  are  very  similar  to  those  of  the  50-mer  probes  (data  not  shown). 
In  order  to  make  data  comparable  across  platforms,  only  data  from  50-mer  oligonucleotide  probes 
were  used.  Hybridization  intensity  values  were  natural  log  transformed  before  fitting  the  linear 
models. 

Samples  that  were  hybridized  to  the  arrays  included  human  and  Salmonella  genomic  DNA.  Data 
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set  3  used  pooled  Salmonella  genomic  DNA  Xbal  restriction  fragments,  representing  half  of  the 
genome  in  three-fold  excess,  in  one  channel,  and  whole  genomic  DNA  in  the  other.  Data  set  4 
contains  205  replicates  of  human  lung  tissue  genomic  DNA  hybridizations  which  were  used  as 
control  channel  in  two-color  hybridizations  experiments. 

Probe  Design  Factors 

The  following  DNA  microarray  probe  design  factors  were  included  in  this  study. 

Probe  hybridization  free  energy  (PHFE) 

PHFE  was  calculated  based  on  the  di-nucleotide  stacking  energies. 

n—  1 

PHFE  =  Ehead  T  ^  &{pki  &k+ 1 )  T  £ tail 

k=  1 


where  n  is  the  oligonucleotide  length,  £  (bk,bk+ 1)  is  the  kth  position  di-nucleotide  stacking  energy, 
and  Eiiea(j  and  etaii  are  the  terminal  nucleotide  stacking  energies.  The  salt  concentrations  for  the 
calculations  were  set  to  1M  Na+,  0M  Mg++,  and  the  temperature  was  set  to  40,  50  or  60°C  for  the 
computation  of  PHFE.  The  di-nucleotide  stacking  energies  are  computed  according  to  SantaLucia 
[17]  and  shown  in  Supplementary  Table  1. 

Pseudo  Probe  Binding  Energy  (PPBE) 

For  a  probe  sequence  {b\ ,  Z?2,  •  •  • ,  bn)  with  n  bases,  the  PPBE  model  is  parameterized  by  di-nucleotide 
stacking  energies  £  and  position  dependent  weights  co,  PPBE  —  £/w  +  EJt=  1 0k£  {bk,bk+\ )  +£tail- 
The  position-dependent  weight  co  is  first  estimated  by  fitting  the  linear  model,  employing  di¬ 
nucleotide  stacking  energies  (as  used  in  the  PHFE  model)  as  initial  values.  Then,  with  the  same 
linear  model  fitting  scheme,  the  pseudo  di-nucleotide  stacking  energies  £  are  approximated  by 
treating  previously  estimated  weights  as  known  quantities.  Such  process  of  “reciprocal”  estima- 
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tion  was  iteratively  carried  out  three  times,  at  which  point  the  ARSS  for  the  PPBE  model  reached 
its  minimum  or  near- minimum  (see  also  the  Linear  Modeling  section  below,  and  Figure  1A). 


Probe  minimum  folding  energy  (PMFE) 


PMFE  is  the  minimum  folding  energy  of  a  single  strand  DNA  sequence  and  represents  the  stabil¬ 
ity  of  the  secondary  structure  of  a  given  sequence.  PMFE  were  computed  by  using  the  MFOLD 
program  [18].  The  program  hybrid-ss-min  was  downloaded  from 
http://www.bioinfo.rpi.edu/applications/hybrid/download.php 

and  executed  on  GNU/Linux.  The  parameters  were  set  as  DNA-DNA  hybridization,  1M  Na+,  OM 
Mg++,  and  the  temperature  was  set  to  40,  50  or  60°C  for  calculation  of  PMFE. 


Probe  dimer  score,  hairpin  score 


The  calculation  of  the  probe  dimer  score  and  the  hairpin  score  was  described  as  part  of  the  Au- 
toDimer  program  based  on  a  sliding  algorithm  [19].  For  screening  probe  dimers,  two  probe  se¬ 
quences  are  incrementally  overlapped,  and  the  presence  or  absence  of  base  pairing  is  evaluated 
and  tabulated.  A  dimer  score  value  was  then  determined  by  combining  the  number  of  Watson- 
Crick  base  pairs  (+1)  with  mismatches  (-1). 


Hairpin  secondary  structures  were  screened  by  using  the  probe  sequence  to  check  for  the  presence 
of  4  and  5  base  loops.  A  minimum  of  a  2-base  stem  were  deemed  to  be  necessary  in  a  hairpin 
structure.  Hairpin  scores  were  sums  of  matched  base  pairs  (+ 1)  in  hairpin  stems  where  mismatches 
are  not  permitted. 
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Homology  Score 


The  homology  score  for  each  oligonucleotide  estimates  the  degree  of  cross  hybridization,  and  is 
based  on  a  BLAST  search  of  the  input  sequence  against  a  species-specific  database.  The  calculation 
of  the  homology  score  was  similar  to  the  one  used  in  the  OligoWiz  program  [20]. 


Homology  Score 


100  x  L  —  £f=1  max  (Bn, . .  .,Bmi) 
100  xL 


where  L  is  the  length  of  the  oligonucleotide,  m  is  the  number  of  Blast  hits  considered  in  position  i 
of  the  oligonucleotide  and  B  —  { Bu ,  •  ■  -,Bmj}  is  the  bit  score  in  position  i. 

Oligonucleotides  with  100%  identity  to  any  considered  BLAST  hit  along  the  full  length  gets  a 
score  of  0.  A  score  value  will  be  assigned  to  oligonucleotides  that  have  no  perfect  homology  to  any 
considered  BLAST  hit.  Percentages  of  identity  lower  than  70%  or  shorter  than  1 5bp  were  removed, 
resulting  in  perfect  homology  scores  of  1. 


Complexity  Score 


Complexity  scores  were  calculated  for  estimating  the  degree  of  common  sequence  fragments  in  a 
given  oligonucleotide,  as  described  in  the  OligoWiz  program  [20].  The  information  content  can  be 
calculated  by  the  following  equation: 


I{w)  = 


n  O)  f  n  (w)  x  4^ 


nt 


log2 


nt 


where  n  (vv)  is  the  number  of  occurrences  of  a  pattern  in  the  genome,  l  (w)  the  pattern  length,  nt  is 
the  total  number  of  patterns  found  in  DNA  sequences  present  in  the  target  pool,  for  example,  the 
whole  genome  in  an  array  comparative  genomic  hybridization.  The  following  equation  was  used 
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to  calculate  the  complexity  score  for  each  oligonucleotide  probe: 


/  i=  t 

Complexity  Score  —  l— norm  l  ^  /(vv,) 

\L-/(w)+l 

where  L  is  the  length  of  the  oligonucleotide,  vv,  is  the  pattern  in  position  i  and  norm  is  a  function 
that  normalizes  the  summed  information  to  a  value  between  1  and  0  by  dividing  them  by  the 
maximum  value.  A  complexity  score  of  0  indicates  an  oligonucleotide  with  very  low  complexity. 
Pattern  lengths  of  2,  5,  8  and  1 1  bases  were  tested  in  this  study. 

Oligonucleotide  Specificity  and  Reproducibility 

Data  set  3,  with  known  expected  oligonucleotide  signal  ratios  (three  fold  changes)  between  the  two 
channels,  was  used  for  estimating  oligonucleotide  probe  specificity.  The  observed  ratios  were  log 
base  2  transformed  for  further  analysis.  Coefficient  of  variation  (cv)  was  used  for  estimating  probe 
reproducibility. 

Linear  Modeling  and  Model  Validation 

R  language  (http://www.r-project.org)  was  used  for  linear  modeling  [29-3 1  ] .  In  the  four  microarray 
data  sets,  simple  linear  models  were  used  to  evaluate  each  individual  probe  design  factor  and  multi¬ 
variate  models  were  used  to  estimate  all  probe  design  factors  together. 

The  Average  Residue  Sum  of  Squares  (ARSS),  which  reflects  the  model  fitness,  was  defined  as  r  — 
1=1  ^ — — ,  where  g,  was  the  observed  /^-transformed  intensity  for  probe  /,  g*  was  the  predicted 
///-transformed  intensity  for  probe  /,  and  n  was  the  number  of  probes.  For  model  selection,  the 
stepAIC  function  in  the  MASS  package  (http://www.r-project.org)  was  used  to  reduce  the  full 
model  to  the  optimal  one.  This  Akaike  information  criterion  (AIC)  is  a  measure  of  the  quality  of 
fit  of  an  estimated  statistical  model  and  balances  the  complexity  of  an  estimated  model  with  the 
accuracy  with  which  the  model  fits  the  data  [32]. 
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The  models  were  validated  in  two  ways:  within  one  data  set  and  across  different  data  sets.  In  both 
cases,  the  leave-many-out  cross-validation  [33]  was  used.  Within-dataset  validation  uses  half  of 
the  data  from  one  data  set  to  train  the  models  and  the  other  half  for  testing  of  the  models.  Cross¬ 
dataset  validation  uses  different  data  sets,  which  may  vary  in  array  platforms  and  sample  species, 
for  training  and  testing. 

Results 

Microarray  CGH  Data  Sets 

Array  CGH  data  is  a  valuable  source  for  studying  microarray  oligonucleotide  probe  performance 
because  it  can  be  assumed  that  most  of  the  probes  in  these  experiments  hybridize  to  approximately 
equimolar  target  amounts,  resulting  in  relatively  uniform  hybridization  signals.  Four  large  aCGH 
data  sets  on  different  array  platforms,  with  a  total  of  657,646  of  50-mer  oligos  and  219  samples, 
were  used  in  this  study  to  evaluate  probe  design  factors  and  to  develop  new  algorithms  (see  Ta¬ 
ble  1). 

Correlation  of  Individual  Probe  Design  Factors  (PDFs)  with  Probe  Hybridization  Intensities 

The  models  examined  are  all  presented  in  the  methods  section  and  will  not  be  repeated  here.  All 
ten  probe  design  factors  (PDFs),  i.e.,  probe  hybridization  free  energy  (PHFE),  probe  minimum 
folding  energy  (PMFE),  hairpin  score,  probe  dimer  score,  homology  score,  complexity  score  (2 
bases),  complexity  score  (5  bases),  complexity  score  (8  bases),  complexity  score  (11  bases),  and 
pseudo  probe  binding  energy  (PPBE),  showed  highly  significant  correlation  with  probe  hybridiza¬ 
tion  intensities,  as  shown  in  Figure  2  (data  set  1)  and  Supplementary  Figure  1  (data  set  2,  3  and 
4).  The  correlation  coefficients  (r),  ARSS,  intercepts  and  slopes  for  these  linear  regression  models 
are  listed  in  Table  2  and  Supplementary  Table  2. 

The  average  residue  sum  of  squares  (ARSS)  values  of  linear  models  based  on  individual  PDFs  were 
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compared,  as  shown  in  Figure  3.  Among  these  factors,  PPBE  generated  the  lowest  ARSS,  sug¬ 
gesting  that  this  factor  is  superior  to  the  traditional  factors  in  correlating  with  probe  hybridization 
intensity.  PPBE  was  modeled  by  iteratively  fitting  di-nucleotide  stacking  energies  and  positional 
weights,  with  the  conventional  di-nucleotide  stacking  energies  as  initial  values.  The  ARSS  values 
from  the  PPBE  model  tend  to  stabilize  after  three  cycles  of  iterative  fitting  of  each  of  positional 
weights  and  pseudo  di-nucleotide  stacking  energies  (Figure  1  and  Supplementary  Figure  2).  The 
positional  weights  and  pseudo  di-nucleotide  stacking  energies  generated  from  the  different  data 
sets  are  entirely  different,  reflecting  the  empirical  nature  of  the  model.  The  positional  weights  and 
pseudo  stacking  energies  for  PPBE  models  from  different  data  sets  are  listed  in  Supplementary 
Table  3  and  4,  the  positional  weights  illustrates  the  effect  of  the  distance  of  the  dinucleotide  to 
the  solid  phase.  The  positional  weights  of  data  set  2  and  data  set  4,  for  example,  showed  inverse 
correlation  for  the  distance  to  the  probe’s  5’  end,  which  may  due  to  the  fact  that  these  platforms 
differed  in  the  ends  of  oligos  that  were  linked  to  the  solid  phase  (5’  versus  3’). 

The  best  individual  traditional  factors  are  PMFE,  dimer  score  and  hairpin  score  in  most  data  sets. 
All  these  three  PDFs  showed  that  less  stable  probe  secondary  structure  positively  correlates  with 
probe  hybridization  intensity,  suggesting  that  the  formation  of  secondary  structure  can  severely 
hinder  the  probe  hybridization  capabilities. 

PHFE’s  linear  correlation  with  probe  hybridization  intensity  was  less  significant,  suggesting  that 
hybridization  behavior  on  microarrays  might  be  different  from  that  in  solution.  Moreover,  quadratic 
rather  than  linear  relationships  were  observed  for  data  set  1  and  3  and  the  mode  (the  peak  points 
shown  in  Figure  2A  and  Supplementary  Figure  1-2A)  varies  among  these  two  data  sets,  suggest¬ 
ing  that  hybridization  conditions  were  not  the  same  for  the  two  data  sets.  We  tried  to  use  quadratic 
equations  to  fit  the  data  set  1  and  3,  but  the  ARSS  values  generated  from  these  models  were  bigger 
than  those  obtained  using  simple  linear  models  (data  not  shown).  This  is  probably  due  to  the  fact 
that  the  majority  of  PHFE  data  points  are  clustered  within  a  very  narrow  range,  where  the  relation¬ 
ship  between  PHFE  and  intensities  may  be  better  described  by  a  linear  equation.  In  future  studies 
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once  there  are  sufficiently  large  data  sets  with  a  higher  PHFE  data  spread  across  a  wider  range 
of  values,  more  advanced  models  can  be  applied  to  scrutinize  the  relationship  between  PHFE  and 
hybridization  intensities  in  a  non-linear  fashion. 

Blast  score  and  complexity  scores  (2,  5,  8,  11  bases)  correlated  least  significantly  with  the  probe 
hybridization  intensity  among  the  PDFs  tested.  No  obvious  differences  were  observed  among  the 
scores  obtained  for  2,  5,  8  and  11  bases  when  correlating  them  with  probe  hybridization  intensity 

(Table  2). 

Among  all  four  data  sets,  PPBE,  PMFE,  dimer  score,  and  hairpin  score  showed  positive  correlation 
with  probe  hybridization  intensity,  and  are  therefore  the  more  reliable  indicators  of  probe  sensitiv¬ 
ity.  The  other  PDFs  displayed  inconsistencies  in  correlation  for  different  data  sets.  For  example, 
PHFE  is  positively  correlated  with  probe  intensity  in  data  sets  2  and  3,  but  is  negatively  correlated 
with  probe  intensity  in  data  sets  1  and  4.  More  complex  models  might  be  developed  for  blast  score 
and  complexity  scores  (2,  5,8,11  bases),  but  that  is  beyond  the  scope  of  this  paper. 

As  shown  in  Supplementary  Table  2,  enormous  variations  were  observed  among  individual  data 
sets  for  the  trend  coefficients  (e.g.,  intercept  and  slope),  possibly  due  to  differences  in  array  man¬ 
ufacture,  sample  and  array  processing,  and  other  factors. 

The  values  of  PHFE  and  PMFE  are  dependent  on  parameters  such  as  hybridization  temperature  and 
concentrations  of  sodium,  most  of  which  were  unavailable  to  us.  However  we  computed  PHFE  and 
PMFE  using  various  potential  parameters,  and  changes  in  parameters  did  not  cause  significant  dif¬ 
ferences  in  correlation  assessments,  the  average  difference  of  ARSS  value  are  0.0058  (0.010  for 
PHFE  and  0.001  for  PMFE)  among  different  temperature  setting.  60°C  was  used  for  the  PHFE 
computation  presented  and  40°C  was  used  for  PMFE  computation  presented  for  all  data  sets  be¬ 
cause  they  slightly  outperformed  other  temperatures. 
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Multi-variate  Linear  Modeling 


For  each  data  set,  a  multi-variate  linear  model  with  PPBE  (W.  PPBE  model)  was  built  based  on 
all  PDFs  for  predicting  probe  hybridization  intensity  and  comparing  the  significance  of  the  indi¬ 
vidual  PDFs.  This  multi-variate  model  showed  significant  improvement  over  all  individual  models 
based  on  each  individual  PDF  (note  the  significantly  diminished  ARSS  values  in  Figure  3  and 
Supplementary  Figure  3).  The  W.  PPBE  model  parameters  are  in  Supplementary  Table  5. 

Increasing  the  number  of  free  parameters  obviously  improves  the  fit.  On  the  other  hand,  overfitting 
is  very  likely  to  happen  and  reduces  or  destroys  the  ability  of  the  model  to  generalize  beyond  the 
data  it  is  built  upon.  The  Akaike  information  criterion  (AIC)  is  an  operational  way  of  trading  off 
the  complexity  of  an  estimated  model  against  how  well  the  model  fits  the  data  [32].  It  not  only 
rewards  improvement  of  fit,  but  also  includes  a  penalty  that  is  an  increasing  function  of  the  num¬ 
ber  of  estimated  parameters  and  thereby  discourages  over-fitting.  In  this  study,  stepwise  selection 
with  AIC  was  used  to  search  for  the  optimal  model  which  only  contains  covariates  (individual 
PDFs)  related  to  the  outcome  (probe  hybridization  intensity).  Stepwise  model  selection  analysis 
showed  that  all  PDFs  contributed  to  the  prediction  of  probe  hybridization  intensity  in  all  data  sets 
with  only  one  exception  in  which  the  complexity  score  (2  bases)  was  not  significant  in  data  set  1 
(Supplementary  Figure  4).  The  most  significant  factor  is  PPBE,  followed  by  PMFE  in  all  data 
sets.  The  order  of  significance  of  other  PDFs  may  vary  among  different  data  sets. 

Generality  of  Linear  Models 

Two  multi-variate  models,  the  W.  PPBE  model  (includes  all  factors)  and  the  W/O  PPBE  model 
(including  all  factors  except  PPBE),  were  developed  using  a  training  data  set  and  tested  on  inde¬ 
pendent  data  sets  to  determine  if  the  models  can  be  reliably  used  as  a  probe  design  tool. 

Applying  within-dataset  validation,  Figure  4  illustrates  that  the  models  developed  from  the  training 
set  can  predict  the  performance  of  oligos  in  the  test  set  almost  as  accurately  as  it  can  predict 
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performance  in  the  training  set.  W.  PPBE  model  outperformed  W/O  PPBE  in  all  cases  suggesting 
that  PPBE  is  a  reliable  factor  although  it  is  generated  by  an  empirical  approach. 

Cross-dataset  validations  (Supplementary  Table  6)  resulted  in  extremely  high  ARSS  values  in 
the  test  data  sets  when  the  W/O  PPBE  and  W.  PPBE  models  were  applied,  even  if  the  array  man¬ 
ufacture  technique  and  sample  species  were  identical  between  test  and  training  set.  The  complex 
multi-variate  models  developed  from  one  data  set  can  therefore  not  be  directly  and  simply  applied 
on  other  data  sets.  The  adverse  performance  was  not  caused  by  PPBE,  as  there  were  no  obvious 
differences  between  W/O  PPBE  and  W.  PPBE  models.  The  substantial  variations  in  correlation 
intercepts  and  slopes  for  each  individual  PDF,  as  observed  in  Supplementary  Table  2,  severely 
hinder  the  cross-dataset  probe  intensity  predictions  using  multi-variate  linear  models. 

Probe  Specificity 

Probe  specificity  is  a  measurement  of  the  capability  of  a  probe  to  discriminate  between  its  specific 
target  sequences  in  the  context  of  a  complex  set  of  non-specific  sequences.  In  a  two-channel  hy¬ 
bridization  experiment,  if  one  channel  includes  the  target  sequence  and  the  other  does  not,  then  the 
probe  with  specificity  for  the  target  can  be  expected  to  yield  a  high  ratio  of  hybridization  signal 
intensity  between  the  two  channels,  which  is  a  measure  of  probe  specificity  in  the  mixture. 

We  estimated  the  oligonucleotide  specificity  using  Data  Set  3,  where  the  targets  in  one  channel 
included  a  three-fold  over-representation  of  approximately  half  of  the  Salmonella  genome  and 
three-fold  under-representation  for  the  other  half  of  the  genome.  Therefore  there  are  three  fold 
differences  in  the  target  concentration  between  the  two  channels  for  all  probes  and  the  expected 
hybridization  ratio  is  3  for  specific  hybridization.  This  was  achieved  by  A/?«I-digcstion  of  sta¬ 
tionary  phase  Salmonella  enterica  sv  Typhimurium  LT2  genomic  DNA,  separation  of  the  seven 
fragments  using  pulsed  field  gel  electrophoresis,  capturing  those  fragments  and  pooling  the  six 
smaller  fragments,  while  keeping  the  big  fragment  separate.  Genomic  DNA  preparations  from  sta- 
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tionary  phase  LT2  were  then  supplemented  either  with  the  big  fragment,  or  with  the  pooled  six 
smaller  fragments,  creating  overrepresentations  of  the  different  halves  of  the  genome. 

Probes  with  stronger  hybridization  intensities  displayed  better  specificity  (Figure  5A).  When  each 
individual  PDF  and  the  predicted  probe  hybridization  intensities  were  compared  with  the  observed 
ratios,  significant  correlation  was  detected  between  the  ratios  and  all  the  factors  (Supplementary 
Figure  6),  most  significantly  for  PHFE,  PMFE,  PPBE  and  Complexity  Score  (8  bases).  The  Pear¬ 
son  correlation  coefficients  are  listed  in  Supplementary  Table  7.  It  is  interesting  to  note  that  PHFE 
is  significantly  and  positively  correlated  with  probe  specificity.  Probes  with  low  PHFE  values  dis¬ 
played  both  low  specificity  and  relatively  low  sensitivity  (as  shown  in  Supplementary  Figure  1-2). 

As  shown  in  Supplementary  Figure  5,  the  relationships  between  log!  based  ratios  and  some 
PDFs  seem  to  be  non-linear.  For  the  sake  of  simplicity,  only  linear  equations  were  considered  in 
the  current  study. 

Probe  Reproducibility 

Data  set  4,  which  includes  205  replicated  hybridizations,  was  used  to  estimate  probe  reproducibil¬ 
ity  using  coefficient  of  variation  (cv).  High  probe  reproducibility  (corresponding  to  low  cv  val¬ 
ues)  is  positively  correlated  with  the  observed  probe  hybridization  intensities  (Figure  5B).  When 
examined  individually,  each  PDF  shows  a  significant  but  distinct  level  of  associations  with  cv 
(Supplementary  Figure  6).  PPBE  and  PHFE  are  the  most  significant  factors.  Correlation  coeffi¬ 
cients  are  listed  in  Supplementary  Table  7.  Note  that  only  linear  equations  were  considered  for 
this  reproducibility  survey. 

Software 

Programs  for  computing  of  PHFE,  PMFE,  probe  dimer  score  and  hairpin  score,  blast  score  and 
complexity  score  were  written  in  Python.  All  programs,  including  parameters  for  computation,  are 
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freely  available  upon  request. 


Discussion 

Microarray  probe  hybridization  signals  are  determined  by  the  equilibrium  of  probe-target  complex 
formation  and  probe-probe  hybridization  capability,  and  are  also  influenced  by  non-specific  bind¬ 
ing  from  the  complex  target.  The  probe  design  factors  (PDFs)  we  studied  here  covered  these  three 
aspects. 

Although  Affymetrix  Chips  are  designed  for  one-sample-for-one-array,  it  is  very  common  to  apply 
multiple  samples  on  a  single  array  from  customized  platforms,  including  in-house  spotted  arrays 
and  many  Nimblegen  arrays  and  we  took  advantage  of  this  fact.  The  natural  log  transformed  in¬ 
tensity  values  from  multiple  arrays  were  averaged  for  each  probe  to  minimize  variation  caused  by 
sample  processing  and  hybridization.  One  advantage  of  our  datasets  for  comparing  probe  perfor¬ 
mance  is  that  genomic  DNA  samples  have  targets  at  the  same  or  similar  concentrations,  allowing 
a  comparison  of  probe  performance  under  similar  target  concentrations. 

Linear  models  were  selected  to  model  the  relationships  between  individual  PDFs  and  probe  per¬ 
formance  based  on  our  observation  that  most  scatter  plots  generated  from  multiple  data  sets  con¬ 
sistently  showed  a  linear  relationship.  The  actual  relationships  may  be  far  more  complex,  never¬ 
theless,  for  a  practical  point  of  view,  linear  models  are  easy  to  handle  and  generate  more  accurate 
predictions  based  on  model  diagnosis  with  ARSS  than  more  complex  models  [34].  The  finding  of 
these  correlations  is  a  useful  first  step  in  trying  to  understand  the  physical  phenomena,  which  are 
clearly  not  subsumed  in  all  the  parameters  currently  in  use.  In  future  research,  we  plan  to  identify 
more  advanced  models  (for  example  non-linear  association  models)  which  may  reduce  the  ARSS 
we  have  achieved  in  the  current  study. 

Probe  minimum  folding  energy  (PMFE),  dimer  score  and  hairpin  score  were  the  factors  used  to 
estimate  the  probe-probe  hybridization  capability.  Of  all  the  traditional  PDFs  (all  factors  except 
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PPBE),  PMFE  correlates  most  significantly  with  probe  hybridization  intensity  in  all  four  data  sets, 
followed  by  dimer  score  and  hairpin  score  in  most  data  sets.  Although  these  three  PDFs  contain 
redundant  information  for  estimation  of  the  probe-probe  hybridization  capabilities,  they  can  not  be 
simply  replaced  by  each  other  as  shown  in  the  stepAIC  analysis,  which  optimizes  the  complexity 
of  the  model  versus  the  fit  [32].  All  three  PDFs  therefore  deliver  unique  information  that  needs  to 
be  considered  for  probe  design. 

Probe  hybridization  free  energy  (PHFE)  is  a  long-established  parameter  for  measuring  probe-target 
hybridization  capability  in  solution.  In  our  study,  PHFE  was  not  as  reliable  in  predicting  probe 
hybridization  intensity  as  other  factors  (PMFE,  dimer  score  and  hairpin  score),  which  may  be 
largely  due  to  the  linkage  of  probes  to  a  solid  phase  in  microarray  hybridization.  To  compensate  for 
the  effect  of  one  end  of  the  probe  being  attached  to  the  matrix,  we  introduced  PPBE  which  modifies 
the  PHFE  calculation  by  adding  a  positional  weight  parameter  and  iteratively  fitting  positional 
weights  and  di-nucleotide  stacking  energies.  PPBE  showed  much  better  capabilities  of  predicting 
probe  hybridization  than  all  other  PDFs  and  tremendous  improvement  over  PHFE.  The  drawback 
of  PPBE  is  that  it  is  platform-dependent  and  preliminary  aCGH  data  is  required  for  developing  the 
PPBE  model  prior  to  application.  The  quality  of  the  training  data  is  critical  for  the  construction  of 
an  accurate  PPBE  model.  There  are  many  factors  that  may  result  in  bad  quality  arrays,  such  as  bad 
sample  quality,  bad  hybridization,  etc.  To  solve  these  problems,  we  suggest  that  CGH  be  performed 
using  normal  genomes  without  copy  number  variation,  and  multiple  hybridizations  with  each  of 
the  dyes  to  be  used  would  be  desirable  to  minimize  the  noise  caused  by  sample  processing. 

Both  PMFE  and  PHFE  are  sodium-dependent.  Generally,  changes  in  free  energy  are  linearly  corre¬ 
lated  to  log-transformed  sodium  concentration  [17],  which  has  been  confirmed  by  us  on  the  Mfold 
web  server  [18]  for  PMFE  and  PHFE.  That  means  all  the  oligonucleotide  PMFE/PHFE  values  will 
change  in  the  same  proportion  if  the  sodium  concentration  changes.  Subsequently,  these  changes 
will  be  cancelled  out  by  adjusting  of  related  coefficients  in  linear  models.  Therefore,  changes  in 
sodium  concentration  had  no  influence  on  the  significance  of  linear  modeling. 
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The  PPBE  model  is  empirical  by  nature,  similar  to  the  positional-dependent-nearest-neighbor 
(PDNN)  model  which  was  designed  for  the  Affymetrix  array  platform  [34],  whose  parameters  sim¬ 
ilarly  need  to  be  empirically  estimated  based  on  hybridization  data  and  significantly  vary  among 
different  Affymetrix  array  platforms.  At  this  stage,  we  do  not  understand  the  physical  properties 
governing  the  parameters,  but  present  a  practical  approach  to  optimizing  oligo  design. 

The  position-dependence  of  the  weighting  factors  is  a  conspicuous  feature  in  such  models.  In 
previous  work,  the  sensitivity  profiles  of  base  C  and  base  A  change  in  a  parabola-like  fashion  along 
the  25-base  probe  sequence,  while  the  same  terms  for  G  and  T  change  monotonically  [35-38].  The 
overall  position  weighting  factors  changes  roughly  as  the  curvature  of  a  parabola  with  peak  and 
shape  varying  across  different  GeneChip  platforms  [14,34,39].  Our  data  reveal  weight  distribution 
patterns  different  from  this  previous  work.  Our  data  were  obtained  on  two  types  of  platforms: 
Nimblegen  in  situ  synthesized  oligonucleotide  arrays  and  a  spotted  oligonucleotide  array.  For  three 
Nimblegen  platforms,  the  weights  change  linearly  for  the  first  35~45  bases  or  so  from  the  3’  end 
and  get  weaker  at  the  free  end  (Figure  IB,  Supplementary  Figure  2B  &  2E).  In  contrast,  a 
parabola-like  curve  is  observed  on  the  other  platform  (Supplementary  Figure  2H).  Although  it  is 
not  the  object  of  this  article  to  explore  a  physical  explanation  for  these  differences,  we  point  out 
some  facts  that  may  be  important  in  further  studies: 

•  We  are  using  platforms  of  50-mer  probes,  while  the  quoted  previous  work  used  25-mer  Affymetrix 
GeneChip  platforms.  Lengthening  of  the  sequence  on  the  platform  inevitably  reduces  the  impor¬ 
tance  of  each  single  base  or  position,  and  weakens  the  position-dependence. 

•  Unlike  Affymetrix  platforms  and  Nimblegen  platforms,  the  probes  of  the  spotted  array  in  this 
study  are  linked  to  the  array  at  the  5  end,  and  there  are  no  terminal  oligonucleotide  linkers 
between  probes  and  the  array  surface.  This  impact  of  this  difference  is  unknown,  but  it  may 
reduce  the  freedom  of  a  probe  and  even  its  effective  length,  leading  to  a  pattern  of  position- 
dependence  similar  to  platforms  of  less  probe  length,  e.g.  Affymetrix  platforms. 
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For  the  fitting  of  the  PPBE  model,  it  is  not  critical  whether  weights  or  energies  were  fitted  first. 
Either  way,  the  finai  converged  models  reach  similar  ARSS  values,  the  average  difference  is  less 
than  0.005  in  ARSS  value.  The  final  weights  and  pseudo  stacking  energies  are  similar  as  well. 
We  began  to  fit  the  models  with  the  conventional  di-nucleotide  stacking  energies  simply  because 
the  modes  reached  convergence  faster.  The  di-nucleotide  stacking  energies  may  express  a  relevant 
part  of  the  physical  properties  underlying  the  model.  It  is  possible  that  the  di-nucleotide  stacking 
energies  may  express  a  relevant  part  of  the  physical  properties  underlying  the  model;  however, 
further  evidence  is  required  to  confirm  this  speculation. 

Blast  and  complexity  scores  reflect  occurrences  of  sequence  segments  similar  to  the  probe,  and  are 
used  for  evaluating  probe  specificity.  It  would  be  simpler  and  easier  to  use  cut-off  thresholds  for 
these  PDFs  to  filter  out  bad  quality  probes.  In  this  study  we  applied  four  different  patterns  for  the 
complexity  score  calculation,  which  are  based  on  2,  5,  8  and  1 1  base  patterns.  The  complexity  score 
(8  bases)  showed  better  correlation  with  probe  specificity  than  other  complexity  score  patterns  and 
blast  score. 

Langmuir  isotherm  oriented  models  were  not  included  in  our  studies.  Although  Langmuir  model 
was  initially  developed  for  adsorption  of  gases  on  glass  surfaces  [40],  its  variations  have  been 
widely  applied  in  researches  on  hybridization  of  oligonucleotides  on  DNA  microarrays  [13-16,41]. 
In  these  models,  the  hybridization  signal  intensities  were  in  essence  divided  into  two  parts:  the  hy¬ 
bridization  of  the  probe  with  its  perfect-matching  target  and  the  background  noise.  Although  such 
models  fit  hybridization  intensity  values  well  for  spike-in  genes  and  corresponding  targets  with 
controlled  concentrations,  they  are  of  less  help  in  screening  probes  for  microarray  design  because 
these  models  for  microarray  design  are  based  on  the  equilibrium  constant,  or  equivalently,  the 
change  of  standard  Gibbs  free  energy  AG°,  which  is  a  PDF  of  less  sensitivity  and  specificity  in 
comparison  to  PMFE  and  PPBE  in  our  study.  In  contrast,  platform-dependent  empirical  models 
based  on  pseudo  free  energies  and  position  weights  can  make  predictions  very  close  to  the  ob¬ 
served  hybridization  intensities  [34,39].  This  fact  encouraged  us  to  explore  pure  empirical  models 
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in  microarray  design. 


In  summary,  we  used  aCGH  as  a  model  system  to  study  the  correlation  between  individual  PDFs 
and  probe  performance  during  microarray  hybridization.  These  individual  correlations  can  be  used 
as  guidance  for  designing  microarray  probes  for  other  complex  experimental  setups  such  as  gene 
expression  analysis.  In  gene  expression  microarray  hybridization,  non-specific  binding,  probe- 
targets  complex  formation  and  probe-probe  binding  capability  will  all  be  influenced  by  the  varying 
concentrations  of  the  targets.  Systematic  study  of  probe  performance  in  such  systems  is  beyond 
the  scope  of  this  study. 

If  preliminary  aCGH  data  is  available,  a  complex  multi-variate  linear  model  including  factor  PPBE 
can  be  developed  and  used  for  refining  arrays.  The  model  can  predict  a  probe  hybridization  inten¬ 
sity  value  which  will  be  an  indicator  of  probe  quality.  Higher  predicted  intensity  values  will  be 
equivalent  to  higher  sensitivity,  improved  specificity  and  reproducibility.  In  practice,  this  strategy 
can  be  used  for  improving  an  existing  array  platform  by  replacing  bad  probes  or  by  expanding  the 
array  by  selecting  probes  predicted  to  perform  well. 

If  aCGH  data  are  unavailable  for  microarray  platform  design,  we  suggest  using  each  individual 
PDF  to  filter  or  rank  probes  instead  of  using  a  complex  model,  because  the  coefficient  parameters 
(intercept  and  slopes)  vary  significantly  among  different  data  sets/platforms.  PMFE,  hairpin  score 
and  probe  dimer  score  can  be  used  to  rank  probe  qualities.  PHFE,  blast  score  and  complexity 
score  can  be  used  to  filter  probes  with  low  specificity.  We  have  provided  all  correlation  parameters 
generated  from  four  data  sets  to  be  used  as  a  guideline  for  filtering  or  ranking  probes.  All  the 
programs  for  calculating  individual  PDFs  are  also  available  from  the  authors. 
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Number  of  Iterative  Fitting  Nucleotide  Position  Original  Dinucleotide  Stacking  Energy 


Fig.  1.  ARSS,  positional  weights,  pseudo  stacking  energies  of  PPBE  model  for  data  set  1. 


A.  Convergence  of  the  PPBE  model  after  three  cycles  of  iterative  fitting  of  each  of  positional  weights  and  pseudo  di-nucleotide  stacking  energies 
(six  cycles  total);  B.  Plot  of  positional  weights;  C.  Comparison  of  traditional  di-nucleotide  stacking  energies  and  pseudo  di-nucleotide  stacking 
energies.  Y  axis  is  the  pseudo  di-nucleotide  stacking  energies;  X  axis  is  the  traditional  di-nucleotide  stacking  energies. 
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Fig.  2.  Box  plots  show  the  correlation  of  individual  probe  design  factors  with  observed  oligonucleotide 


probe  hybridization  intensities  for  data  set  1 . 


Density  curve  (red  line)  is  computed  using  kernel  density  estimates  and  shows  the  distribution  of  individual  probe  design  factors.  Y  axis  (left) 
depicts  probe  hybridization  intensity.  Y  axis  (right)  represents  the  density  of  different  PDFs.  X  axes  are:  A.  Probe  hybridization  free  energy;  B. 
Probe  minimum  folding  energy;  C.  Probe  hairpin  score;  D.  Probe  dimer  score;  E.  Complexity  score  (2  bases);  F.  Complexity  score  (5  bases);  G. 
Complexity  score  (8  bases);  H.  Complexity  score  (11  bases);  I.  Blast  score;  J.  Pseudo  probe  binding  energy. 
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Models  Based  on  Individual  Probe  Design  Factors  (PDFs)  Multi-variate  Model 


Fig.  3.  Relative  ARSS  of  different  models  for  different  data  sets. 

Y-axis  is  the  ratio  of  each  model’s  ARSS  relative  to  place  W.  PPBE  model’s  ARSS.  From  left  to  right,  the  X-axes  are  PHFE,  Complexity  Score 
(2  bases),  Complexity  Score  (5  bases),  Complexity  Score  (8  bases),  Complexity  Score  (11  bases),  blast  score,  dimer  score,  haiipin  score,  PMFE, 
PPBE,  W.  PPBE  model. 


Data  Set  1  Data  Set  2  Data  Set  3  Data  Set  4 


Fig.  4.  Comparisons  of  ARSS  for  within-dataset  validations  using  W/O  PPBE  model  or  W.  PPBE  model. 

Y  axis  is  the  ARSS  value.  Within-dataset  validation.  Blue  bars  show  the  ARSS  value  for  the  training  set  (half  of  the  whole  data  set).  Brown  bars 
show  the  ARSS  value  for  the  test  set  (half  of  the  whole  data  set). 
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Fig.  5.  Correlation  of  probe  hybridization  intensity  with  probe  specificity  and  reproducibility. 


A.  Correlation  of  probe  hybridization  intensity  with  probe  specificity  (observed  log  base  2  transformed  ratio).  Grey  line  shows  where  there  is  no 


change;  B.  Correlation  of  oligonucleotide  probe  hybridization  intensity  with  probe  reproducibility,  represented  as  coefficient  of  variation  (cv). 
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Table  1 .  Array  CGH  data  set  used  in  this  study. 


Data 

Set 

Microarray  Platform 

Sample 

Manufacturer 

Designer 

oligos 

bases 

Role  of  data  set  in 

the  analysis 

sample 

number 

i 

NimbleGen  HG18  whole  genome 

CGH  Array 

Normal  human  male  ge¬ 
nomic  DNA 

NimbleGen 

Inc. 

NimbleGen 

Inc. 

1372S0 

50 

Sensitivity 

6 

2 

NimbleGen  Human  Promoter  Ar¬ 
ray  (custom  design) 

Human  prostate  cell  line 

(PC3M,  267B1)  genomic 

DNA 

NimbleGen 

Inc. 

authors 

220475 

50 

Sensitivity 

4 

3 

NimbleGen  Salmonella  Whole 

Genome  Array  (custom  design) 

Salmonella  LT2  genomic 

DNA 

NimbleGen 

Inc. 

authors 

2S8238 

50 

Sensitivity,  speci¬ 
ficity 

4 

4 

In-house  Spotted  Human  Promoter 

Array  (custom  design) 

Normal  human  lung  tissue 

genomic  DNA 

authors 

authors 

11653 

50 

Sensitivity,  repro¬ 
ducibility 

205 

Table  2 


Simple  linear  model  average  residue  square  sum  (ARSS)  and  correlation  coefficients  (r)  for  the  correlation 
of  Individual  probe  design  factors  (PDFs)  with  probe  hybridization  intensities. 


Data  Set  1 

Data  Set  2 

Data  Set  3 

Data  Set  4 

r 

ARSS 

r 

ARSS 

r 

ARSS 

r 

ARSS 

PHFE 

0.11 

0.168 

0.03 

0.504 

0.03 

0.460 

0.13 

1.668 

PMFE 

0.29 

0.156 

0.27 

0.468 

0.32 

0.414 

0.28 

1.568 

HairpinScore 

0.21 

0.162 

0.22 

0.479 

0.20 

0.442 

0.21 

1.621 

DimerScore 

0.19 

0.164 

0.23 

0.478 

0.17 

0.448 

0.15 

1.660 

ComplexityScore-2B 

0.08 

0.169 

0.05 

0.503 

0.02 

0.461 

0.09 

1.684 

ComplexityScore-5B 

0.04 

0.170 

0.11 

0.498 

0.01 

0.461 

0.02 

1.698 

ComplexityScore-8B 

0.01 

0.170 

0.15 

0.493 

0.01 

0.461 

0.12 

1.675 

ComplexityScore-1  IB 

0.01 

0.170 

0.10 

0.498 

0.02 

0.461 

0.10 

1.683 

BlastScore 

0.02 

0.170 

0.11 

0.498 

0.01 

0.461 

0.18 

1.641 

EPBE 

0.36 

0.148 

0.30 

0.460 

0.65 

0.269 

0.48 

1.301 
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INTRODUCTION 

WebArray  is  a  Web  platform  for  microarray  data  analysis.  As  an  analysis  suite  designed  by  bench 
biologists,  WebArray  is  user-friendly  for  life  scientists  without  a  bioinformatics  background.  It  is 
simple  to  use  but  employs  powerful  analysis  functions.  Analysis  is  based  on  files  uploaded  by  users. 
For  Affymetrix  GeneChip  data,  intensity  files  in  CEL  fonnat  can  be  used.  For  two-color  experiments, 
WebArray  can  recognize  intensity  files  generated  from  many  different  software  packages.  WebArray 
provides  functions  for  data  quality  control,  background  correction,  normalization,  differential  analysis, 
and  plotting  on  a  genome  map.  A  user-friendly  aspect  of  WebArray  is  the  fact  that  users  generally  do 
not  have  to  change  the  default  parameters  for  common  experimental  designs,  so  they  are  usually 
protected  from  applying  the  wrong  statistical  tools.  In  most  cases,  novice  users  will  have  no  problem 
finding  explanations  for  file  formats  or  terms  in  the  extensive  help  system. 

RELATED  INFORMATION 

Supported  Web  browsers  include  Mozilla  Firefox  (recommended),  Microsoft  Internet  Explorer,  Opera, 
Flock,  and  Google  Chrome.  In  WebArray's  Web  page,  the  browser  window  is  divided  into  three 
sections:  WebArray's  flag  is  on  the  top  panel,  the  left  panel  contains  the  function  menu,  and  the  rest  is 
the  work  area.  Generally,  four  steps  are  required  to  perform  a  new  data  analysis:  (1)  register  and  logon, 
(2)  upload  files,  (3)  select  options  for  analysis  and  submit  requests,  and  (4)  browse/download  results. 

WebArray  recognizes  intensity  files  from  many  different  sources,  including  the  Affymetrix,  Agilent, 
ArrayVision,  Genepix,  ImaGene,  QuantArray,  SMD,  and  SPOT  software  packages  as  well  as  any 
variable  user-defined  fonnat.  Only  the  intensity  files  are  mandatory.  Other  files  accepted  by  WebArray 
include  the  following: 

•  gene  list  file:  contains  a  list  of  gene  IDs  and  associated  gene  infonnation 

•  target  file:  contains  infonnation  about  the  samples  associated  with  every  microarray 

•  design  file:  delineates  a  design  matrix  for  linear  model  analysis 

•  spot  type  file:  identifies  of  different  types  of  spots  from  the  gene  list 

•  genome/chromosome  location  file:  a  list  of  genes  with  information  about  their  locations  on  the 
chromosome/genome 

•  composite  nonnalization  file:  contains  a  sub-list  of  spots  expected  to  be  invariant  between 
control  and  experiment,  to  be  used  for  normalization  of  data  between  channels 

Detailed  descriptions  can  be  accessed  simply  by  clicking  on  the  respective  file-type  term  in  the  work 
space. 


WebArray  (http://www.webarray.org)  was  originally  described  by  Xia  et  al.  (2005). 

METHOD 

Registration  and  Logon 

Although  a  guest  account  with  full  functions  can  be  used  by  visitors,  we  encourage  users  to  create  a 
private  account  for  data  security.  After  submitting  registration  information,  a  confirmation  message 
will  be  sent  to  the  user’s  e-mail  address.  A  user  account  will  be  activated  immediately  after  the  user 
responds  to  this  message.  Registered  users  can  logon  to  WebArray  with  their  user  name  and  password. 
Passwords  are  encrypted  for  security. 

1 .  To  register: 

1.  Enter  “http://www.webarray.org”  in  the  address  bar  of  the  Web  browser  to  enter  WebArray’s  Web 
site. 

ii.  Click  on  the  “Register”  button  in  the  function  menu  to  enter  the  registration  page. 

iii.  Enter  required  and  (if  desired)  optional  information,  then  click  on  the  “Register”  button. 

iv.  Check  your  e-mail  box  and  follow  directions  in  the  registration  confirmation  message  from 
WebArray  to  activate  your  account. 

2.  To  log  on: 

i.  Enter  “http://www.webarray.org”  in  the  address  bar  of  the  Web  browser  to  enter  WebArray’s  Web 
site. 

ii.  Enter  user  name/password  and  click  on  the  “Sign  In”  button  in  the  function  menu. 

iii.  Click  on  the  “WebArray”  link  in  the  function  menu. 

Note:  The  “WebArrayDB”  link  in  the  same  window  will  take  you  to  WebArrayDB,  a  database  and 
cross-platform  analysis  package  which  will  be  published  separately  and  is  not  part  of  this  protocol. 

File  Management  (Upload  and  Delete) 

Uploaded  files  are  stored  and  visible  in  the  user  s  private  folders.  To  save  space  on  the  server,  users  are 
encouraged  to  delete  their  files  after  all  analyses  have  been  carried  out.  If  desired,  WebArrayDB  can 
be  used  for  long-term  storage  of  data  in  MIAME  compliant  formats. 

3.  To  upload  files: 

i.  Click  on  the  “Upload”  link  in  the  menu. 

ii.  Choose/add  files  in  the  work  area  by  clicking  on  the  “Browse”  button  and  selecting  the  respective 
files  from  your  computer/network. 

iii.  Click  on  the  “Upload”  button  on  top  or  bottom  of  the  work  area. 

JMaster’s  Java  applet,  “JumpLoader,  ”  has  been  integrated  into  WebArray  as  an  alternative  method  for 
uploading  files.  Clicking  on  the  button  “Try  JumpLoader”  will  open  a  file  manager-like  window  that 
allows  users  to  select  local  files  in  a  drag-and-drop  way.  After  all  files  are  selected,  click  the  “Start 
Upload”  link.  The  uploading  session  will  never  time-out,  unlike  conventional  HTML  forms,  but  make 
sure  not  to  close  the  window  before  all  the  files  have  uploaded  successfully. 

4.  To  delete  files: 

i.  Click  on  the  “Browse/Delete”  link  in  the  menu. 

ii.  Choose  files  to  be  deleted  by  clicking  on  the  check  box  behind  each  file  name. 

iii.  Click  on  the  “Delete  checked  files”  button. 


Data  Analysis 

Users  can  analyze  either  Affymetrix  GeneChip  data  or  dual-channel  data  using  WebArray.  There  are 
two  separate  dialogue  frames  on  WebArray  to  deal  with  these  two  types  of  data.  Both  frames  have  four 
sections  in  the  following  order:  (1)  Experiment  design,  (2)  Parameters  for  analysis,  (3)  Output  options, 
and  (4)  Request  name. 

5.  To  perform  data  analysis: 

i.  Click  on  either  the  “Affymetrix”  or  the  “Two-Color”  link  in  the  menu.  A  frame  for  data  analysis  will 
appear  in  the  work  area. 

ii.  Define  the  experimental  design  in  the  first  section  by  selecting  intensity  files  and  defining  which 
sample  group  each  sample  belongs  to. 

For  Affymetrix  GeneChip  data,  Affymetrix  GeneChip  CEL  files  (usually  with  ".CEL  ”  or  ".cel”  as 
extensions  of  the  file  names)  are  used  as  intensity  files.  Each  sample  can  be  defined  as  "expl,  ”  "exp2,  ” 
"exp 3,  ”  or  "exp  4.  ” 

For  two-color  data,  users  have  to  specify  the  correct  format  for  the  intensity  files  (a  choice  of  nine 
different  formats,  including  Agilent,  ArrayVision,  GenePix,  Imagene,  Quantarray,  and  SPOT). 
Channels  on  the  arrays  can  subsequently  be  defined  as  "ref,  ”  “ctrl,  ”  and  "exp.  ”  Note  that  a  gene  list 
file,  or  both  a  target  file  and  a  design  file,  need  to  be  specified  to  enable  analysis. 

Important:  For  any  experiment  regardless  of  platform,  at  least  two  different  sample  groups  (such  as  ref, 
Ctrl,  expl,  exp2,  etc.)  need  to  be  present  and  each  group  must  include  intensity  data  from  at  least  two 
arrays,  otherwise  statistical  analysis  will  not  be  performed. 

iii.  For  Affymetrix  data,  enter  the  desired  comparisons.  For  example,  “exp2-expl;  exp3-exp2”  will 
compare  (1)  the  difference  between  “exp2”  and  “expl”  and  (2)  the  difference  between  “exp3”  and 
“exp2.”  The  analysis  result  output  file  will  report  the  log2  of  the  ratios  (i.e.,  exp2/expl  and  exp3/exp2) 
for  each  comparison. 

iv.  The  second  and  third  sections  of  the  frame  contain  options  for  analysis  and  result  output.  The  main 
functions  that  WebArray  can  perform  include  background  subtraction,  within-array  nonnalization, 
between-array  nonnalization,  and  differential  statistical  analysis.  The  default  analysis  parameters  are 
suitable  for  the  most  commonly  used  experiment  designs.  In  most  cases,  users  can  analyze  their  data 
without  changing  the  settings,  although  more  sophisticated  users  are  free  to  select  from  any  of  the 
optional  parameters  to  suit  their  specific  requirements.  Each  analysis  operation  is  hot-linked  to  a  help 
file  explaining  the  operation  and  different  options  in  more  detail. 

v.  In  the  last  section,  provide  a  name  for  the  data  analysis  request. 

vi.  Click  on  the  “Submit  Analysis  Request”  button.  The  user  will  automatically  be  taken  to  a  frame  that 
displays  all  analysis  requests  submitted  by  that  user. 

Browsing  Results 

Submitted  requests  will  be  put  in  the  job  queue  on  the  server.  A  few  minutes  or  (occasionally)  hours, 
depending  on  the  level  of  analysis  complexity  and  usage  of  the  server,  will  be  needed  to  complete  a  user 
request.  Users  do  not  have  to  wait  for  a  request  to  be  completed:  they  can  close  their  Web  browsers  and 
return  later  Results  are  presented  in  charts  and  tables  for  downloading  or  browsing  online. 

6.  To  browse  results: 

i.  Follow  the  “Results”  link  in  the  menu.  All  submitted  requests  will  be  listed  in  the  work  area. 

For  every  request,  there  are  two  links:  "Browse"  and  "Edit.  ”  The  latter  brings  the  user  to  the  analysis 
page,  which  facilitates  changing  of  parameters  and  re-submission  of  jobs. 


ii.  Click  on  the  “Browse”  link.  The  work  area  will  be  redirected  to  a  frame  with  all  charts  initially 
requested  by  the  user  and  links  to  result  tables.  A  link  is  offered  for  downloading  a  zip-compressed 
package  of  all  results  for  that  specific  analysis  request.  Alternatively,  users  can  choose  to  only  view  or 
download  the  result  table,  or  the  input  parameters  for  that  analysis  request. 

iii.  If  the  user  decides  to  view  the  result  table,  this  table  will  be  displayed.  The  table  can  be  sorted  in 
ascending  or  descending  order  for  any  of  the  column  headers,  including  p  value. 

iv.  The  output  data  file  will  contain  the  following  columns: 

Columns  “Block,”  “Row,”  “Column,”  “ID,”  and  “Name”  list  the  same  information  as  in  the 
corresponding  columns  in  the  gene  list  file. 

“M”  is  the  log-differential  expression  ratio. 

“A”  is  the  log-intensity  of  the  spot,  a  measure  of  overall  brightness  of  the  spot. 

“t”  is  the  penalized  t-statistic  value. 

“p”  is  the  p-value  corresponding  to  the  f-statistic. 

“B”  is  the  B  statistic;  the  log-odds  of  differential  expression. 

“fdr”  is  the  estimated  false  discovery  rate  incurred  by  setting  the  threshold  at  the 
corresponding  p  value. 

“fp”  is  the  estimated  number  of  false  positives  incurred  by  setting  the  threshold  at  the 
corresponding  p  value. 

“fn”  is  the  estimated  number  of  false  negatives  incurred  by  setting  threshold  at  the 
corresponding  p  value. 

“M,”  “A,”  “t,”  “p,”  and  “B”  are  calculated  with  linear  model  statistical  analysis  (Smyth  2004).  “fdr,”, 
“fp,”  and  “fn”  are  estimated  with  SPLOSH  (Pounds  and  Cheng  2004).  Detailed  information  can  be 
found  in  the  Web  Array  help  documents. 

DISCUSSION 

WebArray  presents  a  simple  interface  for  biologists  to  analyze  microarray  data.  WebArray  integrates 
functions  of  the  LIMMA  package  for  background  correction,  data  normalization,  and  statistical 
analysis.  More  details  about  LIMMA  can  be  found  in  the  help  documents  of  WebArray  or  in  the 
literature  (Smyth  and  Speed  2003;  Smyth  et  al.  2005).  The  “affy”  package  (Gautier  et  al.  2004)  is 
adopted  for  reading  Affymetrix  CEL  files  and  normalizing  Affymetrix  gene  expression  data.  Another 
independent  normalization  method,  which  is  based  on  principal  component  analysis  (PCA),  was  also 
included  in  WebArray  (Stoyanova  et  al.  2004).  The  underlying  algorithm  for  differential  analysis  is  an 
eBayes-moderated  f-test  implemented  in  the  LIMMA  package  (Smyth  2004),  which  is  commonly  used 
for  conventional  data  from  fairly  simple  experimental  designs. 

Other  excellent  peer  Web  services  for  microarray  data  analysis  include  SNOMAD  (Colantuoni  et  al. 
2002),  ArrayQuest  (Argraves  et  al.  2005),  and  GEPAS  (Tarraga  et  al.  2008).  However,  WebArray  has 
great  advantages  in  simplicity  and  flexibility.  The  one -page  analysis  Web  interface  of  WebArray  makes 
all  options  clear  and  easier  to  change  than  the  step-by-step  interfaces  in  other  software  packages.  A  user 
can  submit  multiple  analysis  requests  and  browse  the  results  later,  which  helps  to  save  users’  waiting 
time.  Moreover,  users  can  use  WebArray  just  for  data  normalization  or  the  integration  of  data  from 
separate  files. 


WebArray  is  designed  to  analyze  data  sets  from  a  single  array  platfonn.  For  complex  experiments 


involving  more  than  one  array  platform  per  analysis,  a  more  sophisticated  database  and  analysis  tool, 
WebArrayDB  (http://www.webarraydb.org),  has  been  deployed.  Users  are  encouraged  to  first  master 
WebArray  before  advancing  to  WebArrayDB. 
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METHODS  TO  TREAT  SOLID  TUMORS 


Related  Patent  Application(s’) 

5  This  application  claims  the  benefit  of  U.S.  provisional  patent  application  no.  61/061 ,576  filed  on 

June  13,  2008,  entitled  “Method  to  Treat  Solid  Tumors,  and  designated  by  Attorney  Docket  number 
655233000100.  The  entire  content  of  the  foregoing  patent  application  is  incorporated  herein  by 
reference,  including,  without  limitation,  all  text,  tables  and  drawings. 

10  Statement  of  Government  Support 

This  invention  was  made  in  part  with  government  support  under  Grant  Nos.  R01  AI034829,  R01 
AI052237,  and  R21  AI057733  awarded  by  the  National  Institutes  of  Health  (NIH)  and  Grant  Nos. 
TRDRP  16KT-0045  to  Sidney  Kimmel  Cancer  Center  from  the  Tobacco-Related  Disease  Research 
Program  of  California  and  grants  CA  1 03563;  CA  119811  and  DCD  grant  W81 XWH-06-01 1 7  to 
15  AntiCancer.  The  government  has  certain  rights  in  this  invention. 

Field  of  the  Invention 

The  invention  relates  in  part  to  compositions  and  methods  selectively  to  target  solid  tumors.  More 
20  specifically,  it  concerns  compositions  comprising  expression  systems  for  cytotoxic  proteins  under 
the  control  of  promoters  active  in  tumors. 

Background 

25  A  wide  range  of  bacteria  (e.g.,  Escherichia,  Salmonella,  Clostridium,  Listeria,  and  Bifidobacterium, 
for  example)  have  been  shown  to  preferentially  colonize  solid  tumors.  Salmonella  enterica  and 
avirulent  derivatives  may  effect  some  degree  of  tumor  reduction  by  the  presence  of  the  bacteria  in 
the  solid  tumor.  The  internal  environment  of  solid  tumors  is  not  well  understood  and  may  present 
favorable  growing  conditions  to  colonizing  bacteria. 

30 

Summary 

The  environment  inside  solid  tumors  is  very  different  from  that  in  normal,  healthy  tissue.  Solid 
tumors  often  are  poorly  vascularized  and  sometimes  have  areas  of  necrosis.  The  poor 
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vascularization  contributes  to  hypoxic  or  anoxic  areas  that  can  extend  to  about  100  micrometers 
from  the  vasculature  of  the  solid  tumor.  Solid  tumors  also  can  have  an  internal  pH  lower  than  the 
organism’s  normal  pH.  Necrosis  in  solid  tumors  can  lead  to  a  nutrient  rich  environment  where 
bacteria  capable  of  growing  in  low  oxygen  conditions  can  flourish.  In  addition  to  the  nutrient  rich 
5  environment,  the  internal  spaces  of  solid  tumors  also  offer  some  degree  of  protection  from  a  host 
organisms’  immune  system,  and  thus  shield  the  bacteria  from  the  hosts’  immune  response.  These 
conditions  may  cause  bacteria  to  express  genes  that  are  not  normally  expressed  in  normal,  healthy 
tissues.  These  factors  may  contribute  to  the  preferential  colonization  of  solid  tumors  as  compared 
to  other  normal  tissue. 

10 

The  internal  environment  of  tumors  may  offer  regulatory  conditions  not  well  understood,  in  addition 
to  low  oxygen  and  low  pH.  Promoters  are  nucleotide  sequences  that  in  part  regulate  the 
production  of  mRNA  from  coding  sequences  in  genomic  DNA.  The  mRNA  then  can  be  translated 
into  a  polypeptide  having  a  particular  biological  activity.  Bacterial  promoters  that  are  preferentially 
15  activated  in  tumors  have  been  identified  by  methods  described  herein,  and  compositions  that 
contain  such  promoters,  and  methods  for  using  them,  also  are  described. 

Thus,  provided  herein  are  isolated  nucleic  acid  molecules  that  comprise  a  recombinant  expression 
system,  which  expression  system  comprises  a  nucleotide  sequence  encoding  a  toxic  or 
20  therapeutic  RNA  (e.g.,  mRNA,  tRNA,  rRNA,  siRNA,  ribozyme,  and  the  like),  a  protein  or  an  RNA  or 
protein  that  participates  in  generating  a  toxin  or  therapeutic  agent,  or  a  nucleotide  sequence 
encoding  a  toxic  or  therapeutic  agent,  RNA  or  protein  which  can  mobilize  the  subjects  immune 
response,  operably  linked  to  a  heterologous  promoter  which  promoter  is  preferentially  activated  in 
solid  tumors.  In  certain  embodiments,  the  heterologous  promoter  sequence  can  be  a  naturally 
25  occurring  promoter  sequence.  In  some  embodiments  the  promoter  can  be  an  Enterobacteriaceae 
promoter,  and  in  certain  embodiments  the  promoter  is  a  Salmonella  promoter.  In  some 
embodiments,  the  promoter  may  comprise  (i)  a  nucleotide  sequence  of  Table  2A,  (ii)  a  functional 
promoter  nucleotide  sequence  80%  or  more  identical  to  a  nucleotide  sequence  of  Table  2A,  or  (iii) 
or  a  functional  promoter  subsequence  of  (i)  or  (ii).  In  certain  embodiments,  the  functional  promoter 
30  subsequence  is  about  20  to  about  1 50  nucleotides  in  length. 

The  term  "preferentially  activated  in  solid  tumors"  as  used  herein  refers  to  a  nucleotide  sequence 
that  expresses  a  polypeptide  from  a  coding  sequence  in  tumors  at  a  level  of  at  least  two-fold  more 
than  the  same  polypeptide  from  the  same  coding  sequence  is  expressed  in  non-tumor  cells.  The 
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polypeptide  may  be  expressed  at  detectable  levels  in  non-tumor  cells  or  tissue  in  some 
embodiments,  and  in  certain  embodiments,  the  polypeptide  is  not  detectably  expressed  in  non¬ 
tumor  cells  or  tissue.  As  an  example,  preferential  activation  can  be  determined  using  (i)  cells  from 
the  spleen  as  non-tumor  cells  and  (ii)  PC3  prostate  cancer  cells  in  a  tumor  xenograft  for  tumor 
5  cells.  A  reference  level  of  the  amount  of  polypeptide  produced  can  be  determined  by  the  promoter 
expression  in  the  bacterial  culture  samples,  before  injecting  aliquots  of  the  sample  into  mice  (e.g., 
measuring  GFP  expression  in  the  overnight  cultures  prepared  to  inject  mice,  also  known  as  the 
input  library).  In  some  embodiments,  preferential  activation  in  solid  tumors  is  identified  by  utilizing 
spleen,  PC3  tumor  xenograft  and  reference  level  (i.e.,  input)  determinations  described  in  Example 
10  2  hereafter.  In  certain  embodiments,  a  promoter  is  preferentially  activated  in  a  tumor  of  a  living 

organism.  In  some  embodiments,  there  can  be  two  references  used  on  the  arrays  described  in 
Examples  1  and  2.  One  reference  can  be  a  library  of  all  plasmids  extracted  from  bacteria  grown 
overnight  in  LB+Amp  (see  below)  culture  broth,  as  described  above.  Another  suitable  reference 
that  can  be  used  would  be  to  compare  the  profile  of  bacteria  expressing  GFP  from  a  particular 
15  tissue  of  interest  to  the  profile  of  all  bacteria  (e.g.,  GFP  expresser  and  non-expressers,  for 
example)  isolated  from  the  same  tissue  of  interest. 

Also  provided  are  suitable  delivery  vectors  for  administering  the  isolated  nucleic  acid  which  may 
comprise  a  recombinant  expression  system.  In  some  embodiments,  recombinant  host  cells  that 
20  contain  the  nucleic  acid  molecules  described  above  or  below  may  be  used  to  delivery  the 

expression  system  to  a  patient  or  subject.  In  certain  embodiments,  the  cells  may  be  avirulent 
Salmonella  cells.  Also  provided  are  pharmaceutical  compositions  which  can  comprise  the  nucleic 
acid  reagents  isolated,  generated  or  modified  by  methods  described  herein,  or  cells  which  harbor 
such  nucleic  acid  reagents. 

25 

Also  provided,  in  certain  embodiments,  are  methods  to  treat  solid  tumors,  which  methods  can 
comprise  administering  to  a  subject  harboring  a  tumor  the  nucleic  acid  molecules  isolated  or 
generated  as  described  herein,  the  cells  containing  them  or  compositions  comprising  the  nucleic 
acid  reagents  and/or  cells  harboring  them. 

30 

Also  provided,  in  some  embodiments,  are  methods  for  identifying  a  promoter  preferentially 
activated  in  tumor  tissue  which  method  comprises:  (a)  providing  a  library  of  expression  systems 
each  may  comprise  a  nucleotide  sequence  encoding  a  detectable  protein  operably  linked  to  a 
different  candidate  promoter;  (b)  providing  the  library  to  solid  tumor  tissue  and  to  normal  tissue;  (c) 
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identifying  cells  from  each  tissue  that  show  high  levels  of  expression  of  the  detectable  protein;  and 
(d)  obtaining  the  expressions  systems  from  the  cells  that  produce  greater  levels  of  detectable 
protein  in  tumor  tissue  as  compared  to  normal  tissue,  and  identifying  the  promoters  of  the 
expression  system.  In  some  embodiments,  the  method  may  further  comprise  scoring  the 
5  promoters  identified  in  (d)  (e.g.,  described  below  in  Example  2).  In  some  embodiments,  the  library 
is  provided  in  recombinant  host  cells.  In  certain  embodiments,  the  library  of  DNA  fragments  can  be 
a  random  set  of  fragments  from  a  bacterial  genome  (e.g.,  Salmonella  genome,  for  example)  in  the 
range  of  about  25  to  about  10,000  base  pairs  (bp)  in  length,  for  example.  In  some  embodiments, 
the  library  may  comprise  known  nucleic  acid  regions  or  known  promoter  regions  from  a  bacterial 
10  genome  in  the  range  of  about  25  to  about  10,000  bp  in  length,  for  example. 

In  certain  embodiments,  the  promoters  can  be  Salmonella  promoters  and  the  recombinant  host 
cells  can  be  Salmonella.  In  some  embodiments,  the  candidate  promoters  are  from  bacteria,  or  are 
80%  or  more  identical  to  promoters  from  bacteria.  In  certain  embodiments,  the  bacteria  can  be 
15  Enterobacteriaceae,  and  in  some  embodiments  the  Enterobacteriaceae  can  be  Salmonella. 

Also  provided,  in  some  embodiments,  is  an  expression  system  which  comprises  a  nucleotide 
sequence  encoding  a  toxic  or  therapeutic  RNA  or  protein  or  an  RNA  or  protein  that  participates  in 
generating  a  desired  toxin  or  therapeutic  agent  operably  linked  to  a  promoter  identified  by  the 
methods  described  herein.  Also  provided  herein,  in  certain  embodiments,  are  recombinant  host 
20  cells  that  may  comprise  an  expression  system  described  herein. 

Also  provided,  in  certain  embodiments,  are  methods  to  treat  solid  tumors  which  methods  comprise 
administering  an  expression  system  described  herein  or  cells  containing  an  expression  system 
described  herein,  to  a  subject  harboring  a  solid  tumor. 

25 

Also  provided,  in  some  embodiments,  is  an  expression  system  which  may  comprise  a  first 
promoter  nucleotide  sequence  operably  linked  to  a  first  coding  sequence  and  second  promoter 
nucleotide  sequence  operably  linked  to  a  second  coding  sequence,  where:  the  first  coding 
sequence  and  the  second  coding  sequence  encode  polypeptides  that  individually  do  not  inhibit 
30  tumor  growth;  polypeptides  encoded  by  the  first  coding  sequence  and  the  second  coding 

sequence,  in  combination,  inhibit  tumor  growth;  and  the  first  promoter  nucleotide  sequence  and  the 
second  promoter  nucleotide  sequence  can  be  preferentially  activated  in  solid  tumors  of  living 
organisms.  In  certain  embodiments,  one  or  more  of  the  promoter  nucleotide  sequences  can  be 
preferentially  activated  in  solid  tumors  (e.g.,  one  promoter  is  constitutive  and  one  promoter  is 
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preferentially  activated  in  solid  tumors).  In  some  embodiments,  the  first  promoter  nucleotide 
sequence  and  the  second  promoter  nucleotide  sequence  can  be  in  the  same  nucleic  acid 
molecule.  In  certain  embodiments,  the  first  promoter  nucleotide  sequence  and  the  second 
promoter  nucleotide  sequence  may  be  in  different  nucleic  acid  molecules.  In  some  embodiments, 

5  the  first  promoter  nucleotide  sequence  and  the  second  promoter  nucleotide  sequence  can  be 
bacterial  nucleotide  sequences.  In  certain  embodiments,  the  bacterial  sequences  may  be 
Enterobacteriaceae  sequences,  and  in  some  embodiments  the  Enterobacteriaceae  sequences  can 
be  Salmonella  sequences.  In  certain  embodiments,  the  different  nucleic  acid  molecules  can  be 
disposed  in  the  same  recombinant  host  cell,  and  in  some  embodiments,  the  different  nucleic  acid 
10  molecules  can  be  disposed  in  different  recombinant  host  cells  of  the  same  species.  In  some 
embodiments,  the  different  recombinant  host  cells  can  be  different  bacterial  species. 

In  some  embodiments,  expression  systems  as  described  herein  can  produce  two  components  that 
interact  to  provide  a  functional  therapeutic  agent,  where:  a  first  coding  sequence  may  encode  an 
15  enzyme,  a  second  coding  sequence  may  encode  a  prodrug,  and  the  enzyme  can  process  the 
prodrug  into  a  drug  that  inhibits  tumor  growth.  In  certain  embodiments,  expression  systems  as 
described  herein  can  produce  two  components  that  interact  to  provide  a  functional  therapeutic 
agent,  where;  the  first  coding  sequence  may  encode  a  first  polypeptide,  the  second  coding 
sequence  can  encode  a  second  polypeptide,  and  the  first  polypeptide  and  the  second  polypeptide 
20  can  form  a  complex  that  inhibits  tumor  growth. 

In  some  embodiments,  the  first  promoter  nucleotide  sequence,  the  second  promoter  nucleotide 
sequence,  or  the  first  promoter  nucleotide  sequence  and  the  second  promoter  nucleotide 
sequence  can  comprise  (i)  a  nucleotide  sequence  of  Table  2A,  (ii)  a  functional  promoter  nucleotide 
25  sequence  80%  or  more  identical  to  a  nucleotide  sequence  of  Table  2A,  or  (iii)  or  a  functional 

promoter  subsequence  of  (i)  or  (ii).  In  certain  embodiments,  the  functional  promoter  subsequence 
is  about  20  to  about  150  nucleotides  in  length.  In  some  embodiments,  expression  systems 
described  herein  may  be  contained  in  recombinant  host  cells,  and  in  certain  embodiments,  the 
recombinant  host  cells  can  be  avirulent  Salmonella. 

30 

Also  provided,  in  certain  embodiments,  is  an  expression  system  which  comprises  three  or  more 
promoters  operably  linked  to  three  or  more  coding  sequences,  where  one,  two,  or  more  of  the 
promoter  nucleotide  sequences  are  preferentially  activated  in  solid  tumors.  In  some  embodiments, 
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the  coding  sequences  encode  polypeptides  that  individually  do  not  inhibit  tumor  growth  and 
polypeptides  encoded  by  the  coding  sequences,  in  combination,  inhibit  tumor  growth. 

Certain  embodiments  are  described  further  in  the  following  description,  examples,  claims  and 
5  drawings. 

Brief  Description  of  the  Drawings 

The  drawings  illustrate  embodiments  of  the  invention  and  are  not  limiting.  For  clarity  and  ease  of 
10  illustration,  the  drawings  are  not  made  to  scale  and,  in  some  instances,  various  aspects  may  be 
shown  exaggerated  or  enlarged  to  facilitate  an  understanding  of  particular  embodiments. 

FIG.  1  is  a  flow  diagram  illustrating  the  procedure  used  to  construct  the  nucleic  acid  libraries  used 
to  identify  and  isolate  Salmonella  genomic  sequences  corresponding  to  promoter  elements.  FIG. 
15  2  shows  photographs  taken  of  tumors  expressing  GFP,  demonstrating  the  in  vivo  function  of  the 

promoter  elements  identified  and  isolated  using  the  methods  described  herein. 

Detailed  Description 

20  Methods  and  compositions  described  herein  have  been  designed  to  identify  and  isolate  nucleic 
acid  promoter  sequences  that  can  be  preferentially  activated  under  unique  conditions  found  inside 
solid  tumors  of  living  organisms.  Without  being  limited  by  any  particular  theory  or  to  any  particular 
class  of  inducible  promoters,  promoter  identification  methods  described  herein  may  be  utilized  to 
identify  all  classes  of  promoters  that  are  preferentially  active  in  solid  tumors  of  living  organisms.  In 
25  some  embodiments,  promoter  identification  methods  described  herein  can  potentially  identify 
promoters  activated  by  the  following  classes  of  regulatory  agents,  including  but  not  limited  to, 
gases  (e.g.,  oxygen,  nitrogen,  carbon  dioxide  and  the  like),  pH  (e.g.,  acidic  pH  or  basic  pH),  metals 
(e.g.,  iron,  copper  and  the  like),  hormones  (e.g.,  steroids,  peptides  and  the  like),  and  various 
cellular  components  (e.g.,  purines,  pyrimidines,  sugars,  and  the  like).  The  methods  and 
30  compositions  described  herein  also  can  be  used  to  identify  promoters  preferentially  active  in  any 
part  of  the  body  of  a  living  organism,  including  wounds  or  diseased  parts  of  the  body,  for  example. 
Non-limiting  examples  of  solid  tumors  that  may  be  treated  by  methods  and  compositions  described 
herein  are  sarcomas  (e.g.,  rhabdomyosarcoma,  osteosarcoma,  and  the  like,  for  example), 
lymphomas,  blastomas  (e.g.,  hepatocblastoma,  retinoblastoma,  and  neuroblastom,  for  example), 
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germ  cell  tumors  (e.g.,  choriocarcinoma,  and  endodermal  sinus  tumor,  for  example),  endocrine 
tumors,  and  carcinomas  (e.g.,  adrenocortical  carcinoma,  colorectal  carcinoma,  hepatocellular 
carcinoma,  for  example). 

5  Promoter  elements  preferentially  activated  in  solid  tumors  of  living  organisms,  identified  and 

isolated  using  the  methods  described  herein,  can  be  used  in  targeted,  tumor  specific  therapies.  In 
some  embodiments  a  promoter  nucleotide  sequence  (e.g.,  heterologous  promoter)  is  operably 
linked  to  a  nucleotide  sequence  encoding  one  or  more  therapeutic  agents.  In  some  embodiments, 
the  promoter  sequence  can  be  a  naturally  occurring  nucleic  acid  sequence.  A  therapeutic  agent 
10  includes,  without  limitation,  a  toxin  (e.g.,ricin,  diphtheria  toxin,  abrin,  and  the  like),  a  peptide, 
polypeptide  or  protein  with  therapeutic  activity  (e.g.,  methioninase,  nitroreductase,  antibody, 
antibody  fragment,  single  chain  antibody),  a  prodrug  (e.g.,  CB1954),  an  RNA  molecule  (e.g., 
siRNA,  ribozyme  and  the  like,  for  example).  The  structures  of  such  therapeutic  agents  are  known 
and  can  be  adapted  to  systems  described  herein,  and  can  be  from  any  suitable  organism,  such  as 
15  a  prokaryote  (e.g.,  bacteria)  or  eukaryote  (e.g.,  yeast,  fungi,  reptile,  avian,  mammal  (e.g.,  human  or 
non-human)),  for  example. 

Antibodies  sometimes  are  IgG,  IgM,  IgA,  IgE,  or  an  isotype  thereof  (e.g.,  IgGI ,  lgG2a,  lgG2b  or 
lgG3),  sometimes  are  polyclonal  or  monoclonal,  and  sometimes  are  chimeric,  humanized  or 
20  bispecific  versions  of  such  antibodies.  Polyclonal  and  monoclonal  antibodies  that  bind  specific 
antigens  are  commercially  available,  and  methods  for  generating  such  antibodies  are  known.  In 
general,  polyclonal  antibodies  are  produced  by  injecting  an  isolated  antigen  into  a  suitable  animal 
(e.g.,  a  goat  or  rabbit);  collecting  blood  and/or  other  tissues  from  the  animal  containing  antibodies 
specific  for  the  antigen  and  purifying  the  antibody.  Methods  for  generating  monoclonal  antibodies, 
25  in  general,  include  injecting  an  animal  with  an  isolated  antigen  (e.g.,  often  a  mouse  or  a  rat); 
isolating  splenocytes  from  the  animal;  fusing  the  splenocytes  with  myeloma  cells  to  form 
hybridomas;  isolating  the  hybridomas  and  selecting  hybridomas  that  produce  monoclonal 
antibodies  which  specifically  bind  the  antigen  (e.g.,  Kohler  &  Milstein,  Nature  256:495  497  (1975) 
and  StGroth  &  Scheidegger,  J  Immunol  Methods  5:1  21  (1980)).  Examples  of  monoclonal 
30  antibodies  are  anti  MDM  2  antibodies,  anti-p53  antibodies  (pAB421 ,  DO  1 ,  and  an  antibody  that 

binds  phosphoryl-ser15),  anti-dsDNA  antibodies  and  anti-BrdU  antibodies,  are  described  hereafter. 

Methods  for  generating  chimeric  and  humanized  antibodies  also  are  known  (see,  e.g.,  U.S.  patent 
No.  5,530,101  (Queen,  et  al.),  U.S.  patent  No.  5,707,622  (Fung,  et  al.)  and  U.S.  patent  Nos. 


7 


PATENT 

VIV-1001-PC 

5,994,524  and  6,245,894  (Matsushima,  et  al.)),  which  generally  involve  transplanting  an  antibody 
variable  region  from  one  species  (e.g.,  mouse)  into  an  antibody  constant  domain  of  another 
species  (e.g.,  human).  Antigen-binding  regions  of  antibodies  (e.g.,  Fab  regions)  include  a  light 
chain  and  a  heavy  chain,  and  the  variable  region  is  composed  of  regions  from  the  light  chain  and 
5  the  heavy  chain.  Given  that  the  variable  region  of  an  antibody  is  formed  from  six  complementarity¬ 
determining  regions  (CDRs)  in  the  heavy  and  light  chain  variable  regions,  one  or  more  CDRs  from 
one  antibody  can  be  substituted  (i.e. ,  grafted)  with  a  CDR  of  another  antibody  to  generate  chimeric 
antibodies.  Also,  humanized  antibodies  are  generated  by  introducing  amino  acid  substitutions  that 
render  the  resulting  antibody  less  immunogenic  when  administered  to  humans. 

10 

An  antibody  sometimes  is  an  antibody  fragment,  such  as  a  Fab,  Fab’,  F(ab)’2,  Dab,  Fv  or  single¬ 
chain  Fv  (ScFv)  fragment,  and  methods  for  generating  antibody  fragments  are  known  (see,  e.g., 
U.S.  Patent  Nos.  6,099,842  and  5,990,296  and  PCT/GB00/04317).  In  some  embodiments,  a 
binding  partner  in  one  or  more  hybrids  is  a  single-chain  antibody  fragment,  which  sometimes  are 
15  constructed  by  joining  a  heavy  chain  variable  region  with  a  light  chain  variable  region  by  a 

polypeptide  linker  (e.g.,  the  linker  is  attached  at  the  C-terminus  or  N-terminus  of  each  chain)  by 
recombinant  molecular  biology  processes.  Such  fragments  often  exhibit  specificities  and  affinities 
for  an  antigen  similar  to  the  original  monoclonal  antibodies.  Bifunctional  antibodies  sometimes  are 
constructed  by  engineering  two  different  binding  specificities  into  a  single  antibody  chain  and 
20  sometimes  are  constructed  by  joining  two  Fab’  regions  together,  where  each  Fab’  region  is  from  a 
different  antibody  (e.g.,  U.S.  Patent  No.  6,342,221).  Antibody  fragments  often  comprise 
engineered  regions  such  as  CDR-grafted  or  humanized  fragments.  In  certain  embodiments  the 
binding  partner  is  an  intact  immunoglobulin,  and  in  other  embodiments  the  binding  partner  is  a  Fab 
monomer  or  a  Fab  dimer. 

25 

In  some  embodiments,  one  or  more  promoter  elements  preferentially  active  in  the  solid  tumors  of 
living  organisms  may  be  operably  linked,  on  the  same  or  different  nucleic  acid  reagents,  to 
nucleotide  sequences  that  can  encode  one  or  more  components  of  a  multi-component  (e.g.,  two  or 
more  components)  therapeutic  agent.  Therapeutic  agents  for  such  applications  include,  without 
30  limitation,  an  enzyme  coding  sequence,  a  prodrug  coding  sequence;  a  protein  comprising  two 
peptide  sequences  that  interact  to  form  the  therapeutic  agent;  related  genes  from  a  metabolic 
pathway;  or  one  or  more  RNA  molecules  that  functionally  interact  to  form  a  therapeutic  agent,  for 
example.  In  certain  embodiments  targeted,  tumor  specific  therapies  may  comprise  an  expression 
system  that  may  comprise  a  nucleic  acid  reagent  contained  in  a  recombinant  host  cell.  The  term 
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“operably  linked”  as  used  herein  refers  to  a  nucleic  acid  sequence  (e.g.,  a  coding  sequence) 
present  on  the  same  nucleic  acid  molecule  as  a  promoter  element  and  whose  expression  is  under 
the  control  of  said  promoter  element. 

5  Expression  Systems 

Embodiments  described  herein  provide  an  expression  system  useful  for  delivering  a  therapeutic 
agent  or  pharmaceutical  composition  (e.g.,  toxin,  drug,  prodrug,  or  microorganism  (e.g. 
recombinant  host  cell)  expressing  a  toxin,  drug,  or  prodrug)  to  a  specific  target  or  tissue  within  a 
10  living  subject  exhibiting  a  condition  treatable  by  the  therapeutic  agent  or  pharmaceutical 

composition  (e.g.,  living  organism  with  a  solid  tumor,  for  example).  Embodiments  described  herein 
also  may  be  useful  for  driving  production  of  a  system  for  generating  toxic  substances  or  to  elicit 
responses  from  the  host,  for  example  by  expressing  cytokines,  interleukins,  growth  inhibitors,  or 
therapeutic  RNA’s  or  proteins  from  the  expression  system  or  causing  the  host  organism  to  increase 
15  expression  of  cytokines,  interleukins,  growth  inhibitors,  or  therapeutic  RNA’s  or  proteins  by 

expression  of  an  agent  which  can  elicit  the  appropriate  metabolic  or  immunological  response.  In 
some  embodiments,  the  expression  system  may  comprise  a  nucleic  acid  reagent  and  a  delivery 
vector.  The  delivery  vector  sometimes  can  be  a  microorganism  (e.g.,  bacteria,  yeast,  fungi,  or 
virus)  that  harbors  the  nucleic  acid  reagent,  and  can  express  the  product  of  the  nucleic  acid 
20  reagent  or  can  deliver  the  nucleic  acid  reagent  to  the  subject  for  expression  within  host  cells. 

In  some  embodiments,  an  expression  system  may  comprise  a  promoter  element  operably  linked  to 
a  therapeutic  gene  of  a  nucleic  acid  reagent.  The  nucleic  acid  reagent  may  be  disposed  in  a 
bacterial  host,  where  the  bacterial  host  comprising  the  nucleic  acid  reagent  is  delivered  to  a 
25  eukaryotic  organism  such  that  expression  of  the  nucleic  acid  reagent,  in  the  appropriate  tissue  or 
structure  (e.g.,  inside  a  solid  tumor,  for  example)  causes  a  therapeutic  effect.  In  certain 
embodiments,  the  expression  system  promoter  elements  sometimes  can  be  regulated  (e.g., 
induced  or  repressed)  in  a  eukaryotic  environment  (e.g.,  bacteria  inside  a  eukaryotic  organism  or 
specific  organ  or  structure  in  an  organism).  In  some  embodiments,  the  expression  system 
30  promoter  elements,  isolated  using  methods  described  herein,  can  be  selectively  regulated.  That  is, 
the  promoter  elements  sometimes  can  be  influenced  to  increase  transcription  by  providing  the 
appropriate  selective  agent  (e.g.,  administering  tetracycline  or  kanomycin,  metals,  or  starvation  for 
a  particular  nutrient,  for  example,  and  described  further  below)  to  the  host  organism,  such  that  the 
recombinant  host  cell  containing  the  nucleic  acid  reagent  comprising  a  selectable  promoter 
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element  responds  by  showing  a  demonstrable  (e.g.,  at  least  two  fold,  for  example)  increase  in 
transcription  activity  from  the  promoter  element. 

In  certain  embodiments,  an  expression  system  may  comprise  a  nucleotide  sequence  encoding  a 
5  toxic  or  therapeutic  RNA  or  protein  or  an  RNA  or  protein  that  participates  in  generating  a  toxin  or 
therapeutic  agent  operably  linked  to  a  promoter  identified  by  the  methods  described  herein.  In 
some  embodiments,  an  expression  system  as  described  herein  may  comprise  a  first  promoter 
nucleotide  sequence  operably  linked  to  a  first  coding  sequence  and  a  second  promoter  nucleotide 
sequence  operably  linked  to  a  second  coding  sequence,  where:  the  first  coding  sequence  and  the 
10  second  coding  sequence  may  encode  RNA  or  polypeptides  that  individually  do  not  inhibit  tumor 
growth;  RNA  or  polypeptides  encoded  by  the  first  coding  sequence  and  the  second  coding 
sequence,  in  combination,  inhibit  tumor  growth;  and  the  first  promoter  nucleotide  sequence  and  the 
second  promoter  nucleotide  sequence  can  be  preferentially  activated  in  solid  tumors  of  living 
organisms.  In  some  embodiments  an  expression  system  as  described  herein  may  comprise  two  or 
15  more  sequences  encoding  toxic  or  therapeutic  RNA  or  proteins,  or  RNA  or  proteins  that  participate 
in  generating  a  toxin  or  therapeutic  agent,  operably  linked  to  a  similar  number  of  promoter 
elements  identified  by  methods  described  herein. 

In  some  embodiments,  a  nucleotide  coding  sequence  can  encode  an  RNA  that  has  a  function 
20  other  than  encoding  a  protein.  Non-limiting  examples  of  coding  sequences  that  do  not  encode 
proteins  are  tRNA,  rRNA,  siRNA,  or  anti-sense  RNA.  rRNA’s  (e.g.,  ribosomal  RNA’s)  of  various 
organisms  sometimes  have  point  mutations  that  confer  antibiotic  resistance.  Expression  of  rRNA’s 
that  contain  antibiotic  resistance  mutations  inside  a  solid  tumor,  when  the  rRNA’s  are  operably 
linked  to  a  heterologous  promoter  sequence  isolated  using  methods  described  herein,  may  provide 
25  a  method  for  ensuring  the  survival  of  the  recombinant  cells  only  in  the  tumor  environment,  due  to 
the  resistance  phenotype  induced  in  the  solid  tumors.  Therefore,  all  recombinant  cells  carrying  the 
expression  system  would  be  susceptible  to  the  antibiotic  administered  to  the  organism,  except  in 
the  inside  of  the  solid  tumor. 

30  In  some  embodiments,  there  is  provided  an  expression  system  described  above,  where  the  first 

coding  sequence  can  encode  an  enzyme,  the  second  coding  sequence  can  encode  a  prodrug,  and 
the  enzyme  can  process  the  prodrug  into  a  drug  that  inhibits  tumor  growth.  A  non-limiting  example 
of  this  type  of  combination  is  an  inactive  peptide  toxin  and  an  enzyme  which  cleaves  the  inactive 
form  to  release  the  active  form  of  the  toxin.  Another  example  may  be  an  antibody,  whose  protein 
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sequence  has  been  determined  and  a  synthetic  gene  has  been  generated,  and  which  requires 
processing  (e.g.,  polypeptide  cleavage)  for  assembly  into  an  active  form.  In  such  examples,  the 
first  and  second  coding  sequences  are  preferentially  expressed  inside  the  solid  tumors,  as  the 
methods  described  herein  select  promoter  elements  preferentially  activated  in  solid  tumors.  The 
5  combination  of  targeted,  tumor  specific  expression,  by  delivery  of  the  expression  system 

comprising  the  nucleic  acid  reagent  further  comprising  promoter  elements  preferentially  activated 
in  solid  tumors  of  living  organisms,  as  identified  and  isolated  as  described  herein,  and  enzyme 
catalyzed  activation  of  prodrugs,  offers  a  significant  improvement  in  gene-directed  enzyme  prodrug 
therapies.  The  expression  systems  described  herein  can  be  used  to  express  prodrugs  that,  when 
10  activated,  increase  the  bioavailability  of  therapeutic  agents  in  solid  tumor,  or  directly  inhibit  tumor 
growth  by  the  action  of  the  activated  prodrug.  In  some  embodiments,  the  second  coding  sequence 
can  be  a  bacterial  operon  encoding  a  number  of  peptides,  polypeptides  or  proteins  which 
functionally  form  the  prodrug.  In  some  embodiments  the  first  and  second  coding  sequences  can 
encode  synthetically  engineered  enzymes  or  proteins  specifically  designed  as  prodrugs  for 
15  anticancer  therapies. 

In  some  embodiments,  there  is  provided  an  expression  system,  where  the  first  coding  sequence 
can  encode  a  first  polypeptide,  the  second  coding  sequence  can  encode  a  second  polypeptide, 
and  the  first  polypeptide  and  the  second  polypeptide  form  a  complex  that  inhibits  tumor  growth. 

20  Non-limiting  examples  of  two  component  protein  or  peptide  toxins  that  can  be  used  as  therapeutic 
agents  include  Diphtheria  toxin,  various  Pertussis  toxins,  Pseudomonas  endotoxin,  various  Anthrax 
toxins,  and  bacterial  toxins  that  act  as  superantigens  (e.g.,  Staphylococcus  aureus  Exfoliatin  B,  for 
example).  A  combination  of  targeted,  tumor  specific  expression,  by  delivery  of  an  expression 
system  comprising  a  nucleic  acid  reagent  further  comprising  promoter  elements  preferentially 
25  activated  in  solid  tumors  as  identified  and  isolated  as  described  herein,  and  the  use  of  two 

component  protein  or  peptide  toxins,  offers  a  significant  improvement  in  targeted,  in  situ  delivery  of 
anticancer  therapies.  Another  example  of  a  complex  can  include  expressing  two  or  more  portions 
of  an  antibody  (e.g.,  a  light  chain  and  a  heavy  chain),  where  the  two  or  more  portions  can  self 
assemble  into  a  complex  having  antibody  binding  activity  (e.g.,  antibody  fragment). 

30 

In  some  embodiments,  the  promoter  elements  of  the  expression  systems  described  herein  (e.g., 
the  first  promoter  nucleotide  sequence,  the  second  promoter  nucleotide  sequence,  or  both 
promoter  nucleotide  sequences)  comprise  (i)  a  nucleotide  sequence  of  Table  2A,  (ii)  a  functional 
promoter  nucleotide  sequence  80%  or  more  identical  to  a  nucleotide  sequence  of  Table  2A,  or  (iii) 
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or  a  functional  promoter  subsequence  of  (i)  or  (ii).  That  is,  a  functional  promoter  nucleotide 
sequences  that  is  at  least  80%  or  more,  81  %  or  more,  82%  or  more,  83%  or  more,  84%  or  more, 
85%  or  more,  86%  or  more,  87%  or  more,  88%  or  more,  89%  or  more,  90%  or  more,  91%  or  more, 
92%  or  more,  93%  or  more,  94%  or  more,  95%  or  more,  96%  or  more,  97%  or  more,  98%  or  more, 
5  or  99%  or  more  identical  to  a  nucleotide  sequence  of  Table  2A.  The  term  “identical”  as  used 
herein  refers  to  two  or  more  nucleotide  sequences  having  substantially  the  same  nucleotide 
sequence  when  compared  to  each  other.  One  test  for  determining  whether  two  nucleotide 
sequences  or  amino  acids  sequences  are  substantially  identical  is  to  determine  the  percent  of 
identical  nucleotide  sequences  or  amino  acid  sequences  shared. 

10 

Sequence  identity  can  also  be  determined  by  hybridization  assays  conducted  under  stringent 
conditions.  As  use  herein,  the  term  “stringent  conditions”  refers  to  conditions  for  hybridization  and 
washing.  Stringent  conditions  are  known  to  those  skilled  in  the  art  and  can  be  found  in  Current 
Protocols  in  Molecular  Biology,  John  Wiley  &  Sons,  N.Y.  ,  6.3. 1-6. 3. 6  (1989).  Aqueous  and  non- 
15  aqueous  methods  are  described  in  that  reference  and  either  can  be  used.  An  example  of  stringent 
hybridization  conditions  is  hybridization  in  6X  sodium  chloride/sodium  citrate  (SSC)  at  about  45°C, 
followed  by  one  or  more  washes  in  0.2X  SSC,  0.1  %  SDS  at  50°C.  Another  example  of  stringent 
hybridization  conditions  are  hybridization  in  6X  sodium  chloride/sodium  citrate  (SSC)  at  about 
45°C,  followed  by  one  or  more  washes  in  0.2X  SSC,  0.1%  SDS  at  55°C.  A  further  example  of 
20  stringent  hybridization  conditions  is  hybridization  in  6X  sodium  chloride/sodium  citrate  (SSC)  at 
about  45°C,  followed  by  one  or  more  washes  in  0.2X  SSC,  0. 1  %  SDS  at  60°C.  Often,  stringent 
hybridization  conditions  are  hybridization  in  6X  sodium  chloride/sodium  citrate  (SSC)  at  about 
45°C,  followed  by  one  or  more  washes  in  0.2X  SSC,  0.1%  SDS  at  65°C.  More  often,  stringency 
conditions  are  0.5M  sodium  phosphate,  7%  SDS  at  65°C,  followed  by  one  or  more  washes  at  0.2X 
25  SSC,  1%  SDS  at65°C. 

Calculations  of  sequence  identity  can  be  performed  as  follows.  Sequences  are  aligned  for  optimal 
comparison  purposes  (e.g.,  gaps  can  be  introduced  in  one  or  both  of  a  first  and  a  second  amino 
acid  or  nucleic  acid  sequence  for  optimal  alignment  and  non-homologous  sequences  can  be 
30  disregarded  for  comparison  purposes).  The  length  of  a  reference  sequence  aligned  for 

comparison  purposes  is  sometimes  30%  or  more,  40%  or  more,  50%  or  more,  often  60%  or  more, 
and  more  often  70%  or  more,  80%  or  more,  90%  or  more,  or  100%  of  the  length  of  the  reference 
sequence.  The  nucleotides  or  amino  acids  at  corresponding  nucleotide  or  polypeptide  positions, 
respectively,  are  then  compared  among  the  two  sequences.  When  a  position  in  the  first  sequence 
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is  occupied  by  the  same  nucleotide  or  amino  acid  as  the  corresponding  position  in  the  second 
sequence,  the  nucleotides  or  amino  acids  are  deemed  to  be  identical  at  that  position.  The  percent 
identity  between  the  two  sequences  is  a  function  of  the  number  of  identical  positions  shared  by  the 
sequences,  taking  into  account  the  number  of  gaps,  and  the  length  of  each  gap,  introduced  for 
5  optimal  alignment  of  the  two  sequences.  Comparison  of  sequences  and  determination  of  percent 
identity  between  two  sequences  can  be  accomplished  using  a  mathematical  algorithm.  Percent 
identity  between  two  amino  acid  or  nucleotide  sequences  can  be  determined  using  the  algorithm  of 
Meyers  &  Miller,  CABIOS  4:  11-17  (1989),  which  has  been  incorporated  into  the  ALIGN  program 
(version  2.0),  using  a  PAM120  weight  residue  table,  a  gap  length  penalty  of  12  and  a  gap  penalty 
10  of  4.  Also,  percent  identity  between  two  amino  acid  sequences  can  be  determined  using  the 

Needleman  &  Wunsch,  J.  Mol.  Biol.  48:  444-453  (1970)  algorithm  which  has  been  incorporated  into 
the  GAP  program  in  the  GCG  software  package  (available  at  the  http  address  www.gcg.com), 
using  either  a  Blossum  62  matrix  or  a  PAM250  matrix,  and  a  gap  weight  of  16,  14,  12,  10,  8,  6,  or  4 
and  a  length  weight  of  1 , 2,  3,  4,  5,  or  6.  Percent  identity  between  two  nucleotide  sequences  can 
15  be  determined  using  the  GAP  program  in  the  GCG  software  package  (available  at  http  address 
www.gcg.com),  using  a  NWSgapdna.CMP  matrix  and  a  gap  weight  of  40,  50,  60,  70,  or  80  and  a 
length  weight  of  1 , 2,  3,  4,  5,  or  6.  A  set  of  parameters  often  used  is  a  Blossum  62  scoring  matrix 
with  a  gap  open  penalty  of  12,  a  gap  extend  penalty  of  4,  and  a  frameshift  gap  penalty  of  5. 

20  In  some  embodiments,  the  first  promoter  nucleotide  sequence  and  the  second  nucleotide 
sequence  can  be  in  the  same  nucleic  acid  molecule  (e.g.,  the  same  nucleic  acid  reagent,  for 
example).  In  certain  embodiments,  the  first  promoter  nucleotide  sequence  and  the  second 
nucleotide  sequence  can  be  in  different  nucleic  acid  molecule  (e.g.,  different  nucleic  acid  reagents, 
for  example).  In  some  embodiments,  three  or  more  promoters  can  be  in  the  same  nucleic  acid 
25  molecule,  and  in  certain  embodiments,  three  or  more  promoters  can  be  on  different  nucleic  acid 
molecules.  In  some  embodiments,  an  expression  system  may  comprise  functional  promoter 
subsequences  that  are  about  20  to  about  150  nucleotides  in  length. 

In  some  embodiments,  the  first  promoter  nucleotide  sequence  (e.g.,  promoter  element)  and  the 
30  second  promoter  nucleotide  sequence  can  be  bacterial  nucleotide  sequences.  In  some 
embodiments,  three  or  more  promoter  nucleotide  sequences  can  be  bacterial  nucleotide 
sequences.  In  certain  embodiments,  the  bacterial  sequences  are  Enterobacteriaceae  sequences, 
and  in  some  embodiments,  the  Enterobacteriaceae  sequences  are  Salmonella  sequences.  In 
some  embodiments,  the  expression  systems  described  herein  are  contained  within  recombinant 
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host  cells.  In  certain  embodiments,  the  cells  can  be  Enterobacteriaceae.  In  some  embodiments, 
the  Enterobacteriaceae  can  be  Salmonella,  and  in  certain  embodiments,  the  Salmonella  can  be 
avirulent  Salmonella. 

Nucleic  Acids 
5 

A  nucleic  acid  can  comprise  certain  elements,  which  often  are  selected  according  to  the  intended 
use  of  the  nucleic  acid.  Any  of  the  following  elements  can  be  included  in  or  excluded  from  a 
nucleic  acid  reagent.  A  nucleic  acid  reagent,  for  example,  may  include  one  or  more  or  all  of  the 
following  nucleotide  elements:  one  or  more  promoter  elements,  one  or  more  5’  untranslated 
10  regions  (5’UTRs),  one  or  more  regions  into  which  a  target  nucleotide  sequence  may  be  inserted 
(an  “insertion  element”),  one  or  more  target  nucleotide  sequences,  one  or  more  3’  untranslated 
regions  (3’UTRs),  and  a  selection  element.  A  nucleic  acid  reagent  can  be  provided  with  one  or 
more  of  such  elements  and  other  elements  (e.g.,  antibiotic  resistance  genes,  multiple  cloning  sites, 
and  the  like)  can  be  inserted  into  the  nucleic  acid  reagent  before  the  nucleic  acid  is  introduced  into 
15  a  suitable  expression  host  or  system  (e.g.,  in  vivo  expression  in  host,  or  in  vitro  expression  in  a  cell 
free  expression  system,  for  example).  The  elements  can  be  arranged  in  any  order  suitable  for 
expression  in  the  chosen  expression  system. 

In  some  embodiments,  a  nucleic  acid  reagent  may  comprise  a  promoter  element  where  the 
20  promoter  element  comprises  two  distinct  transcription  initiation  start  sites  (e.g.,  two  promoters 

within  a  promoter  element,  for  example).  In  some  embodiments,  a  promoter  element  in  a  nucleic 
acid  reagent  may  comprise  two  promoters.  In  certain  embodiments,  the  promoter  element  may 
comprise  a  constitutive  promoter  and  an  inducible  promoter,  and  in  some  embodiments  a  promoter 
element  may  comprise  two  inducible  promoters.  In  certain  embodiments  a  nucleic  acid  reagent 
25  may  comprise  two  or  more  distinct  or  different  promoter  elements.  In  some  embodiments,  the 
promoters  may  respond  to  the  same  or  different  inducers  or  repressors  of  transcription  (e.g., 
induce  or  repress  expression  of  a  nucleic  acid  reagent  from  the  promoter  element).  A  nucleic  acid 
reagent  sometimes  can  contain  more  than  one  promoter  element  that  is  turned  on  at  specific  times 
or  under  specific  conditions. 

30 

A  nucleic  acid  reagent  sometimes  can  comprise  a  5’  UTR  that  may  further  comprise  one  or  more 
elements  endogenous  to  the  nucleotide  sequence  from  which  it  originates,  and  sometimes 
includes  one  or  more  exogenous  elements.  A  5’  UTR  can  originate  from  any  suitable  nucleic  acid, 
such  as  genomic  DNA,  plasmid  DNA,  RNA  or  mRNA,  for  example,  from  any  suitable  organism 
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(e.g.,  virus,  bacterium,  yeast,  fungi,  plant,  insect  or  mammal).  The  artisan  may  select  appropriate 
elements  for  the  5’  UTR  based  upon  the  expression  system  being  utilized.  A  5’  UTR  sometimes 
comprises  one  or  more  of  the  following  elements  known  to  the  artisan:  enhancer  sequences, 
silencer  sequences,  transcription  factor  binding  sites,  accessory  protein  binding  site,  feedback 
5  regulation  agent  binding  sites,  Pribnow  box,  TATA  box,  -35  element,  E-box  (helix-loop-helix 

binding  element),  transcription  initiation  sites,  translation  initiation  sites,  ribosome  binding  site  and 
the  like.  In  some  embodiments,  a  promoter  element  may  be  isolated  such  that  all  5’  UTR  elements 
necessary  for  proper  conditional  regulation  are  contained  in  the  promoter  element  fragment,  or 
within  a  functional  sub  sequence  of  a  promoter  element  fragment. 

10 

A  nucleic  acid  reagent  sometimes  can  have  a  3’  UTR  that  may  comprise  one  or  more  elements 
endogenous  to  the  nucleotide  sequence  from  which  it  originates,  and  sometimes  includes  one  or 
more  exogenous  elements.  A  3’  UTR  can  originate  from  any  suitable  nucleic  acid,  such  as 
genomic  DNA,  plasmid  DNA,  RNA  or  mRNA,  for  example,  from  any  suitable  organism  (e.g.,  virus, 
15  bacterium,  yeast,  fungi,  plant,  insect  or  mammal).  The  artisan  may  select  appropriate  elements  for 
the  3’  UTR  based  upon  the  expression  system  being  utilized.  A  3’  UTR  sometimes  comprises  one 
or  more  of  the  following  elements,  known  to  the  artisan,  which  may  influence  expression  from 
promoter  elements  within  a  nucleic  acid  reagent:  transcription  regulation  site,  transcription  initiation 
site,  transcription  termination  site,  transcription  factor  binding  site,  translation  regulation  site, 

20  translation  termination  site,  translation  initiation  site,  translation  factor  binding  site,  ribosome 
binding  site,  replicon,  enhancer  element,  silencer  element  and  polyadenosine  tail.  A  3’  UTR 
sometimes  includes  a  polyadenosine  tail  and  sometimes  does  not,  and  if  a  polyadenosine  tail  is 
present,  one  or  more  adenosine  moieties  may  be  added  or  deleted  from  it  (e.g.,  about  5,  about  10, 
about  15,  about  20,  about  25,  about  30,  about  35,  about  40,  about  45  or  about  50  adenosine 
25  moieties  may  be  added  or  subtracted). 

A  nucleic  acid  reagent  that  is  part  of  an  expression  system  sometimes  comprises  a  nucleotide 
sequence  adjacent  to  the  nucleic  acid  sequence  encoding  a  therapeutic  agent  or  pharmaceutical 
composition  that  is  translated  in  conjunction  with  the  ORF  and  encodes  an  amino  acid  tag.  The 
30  tag-encoding  nucleotide  sequence  is  located  3’  and/or  5’  of  an  ORF  in  the  nucleic  acid  reagent, 
thereby  encoding  a  tag  at  the  C-terminus  or  N-terminus  of  the  protein  or  peptide  encoded  by  the 
ORF.  Any  tag  that  does  not  abrogate  transcription  and/or  translation  may  be  utilized  and  may  be 
appropriately  selected  by  the  artisan. 
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A  tag  sometimes  comprises  a  sequence  that  localizes  a  translated  protein  or  peptide  to  a 
component  in  a  system,  which  is  referred  to  as  a  “signal  sequence”  or  “localization  signal 
sequence”  herein.  A  signal  sequence  often  is  incorporated  at  the  N-terminus  of  a  target  protein  or 
target  peptide,  and  sometimes  is  incorporated  at  the  C-terminus.  Examples  of  signal  sequences 
5  are  known  to  the  artisan,  are  readily  incorporated  into  a  nucleic  acid  reagent,  and  often  are 

selected  according  to  the  expression  chosen  by  the  artisan.  A  tag  sometimes  is  directly  adjacent 
to  an  amino  acid  sequence  encoded  by  a  nucleic  acid  reagent  (i.e. ,  there  is  no  intervening 
sequence)  and  sometimes  a  tag  is  substantially  adjacent  to  the  amino  acid  sequence  encoded  by 
the  nucleic  acid  reagent  (e.g.,  an  intervening  sequence  is  present).  An  intervening  sequence 
10  sometimes  includes  a  recognition  site  for  a  protease,  which  is  useful  for  cleaving  a  tag  from  a 

target  protein  or  peptide.  A  signal  sequence  or  tag,  in  some  embodiments,  localizes  a  translated 
protein  or  peptide  to  a  cell  membrane. 

Examples  of  signal  sequences  include,  but  are  not  limited  to,  a  nucleus  targeting  signal  (e.g., 

15  steroid  receptor  sequence  and  N-terminal  sequence  of  SV40  virus  large  T  antigen);  mitochondria 
targeting  signal  (e.g.,  amino  acid  sequence  that  forms  an  amphipathic  helix);  peroxisome  targeting 
signal  (e.g.,  C-terminal  sequence  in  YFG  from  S.cerevisiae)',  and  a  secretion  signal  (e.g.,  N- 
terminal  sequences  from  invertase,  mating  factor  alpha,  PH05  and  SUC2  in  S.cerevisiae ;  multiple 
N-terminal  sequences  of  B.  subtilis  proteins  (e.g.,  Tjalsma  et  al. ,  Microbiol. Molec.  Biol.  Rev.  64: 

20  515-547  (2000));  alpha  amylase  signal  sequence  (e.g.,  U.S.  Patent  No.  6,288,302);  pectate  lyase 

signal  sequence  (e.g.,  U.S.  Patent  No.  5,846,818);  precollagen  signal  sequence  (e.g.,  U.S.  Patent 
No.  5,712,1 14);  OmpA  signal  sequence  (e.g.,  U.S.  Patent  No.  5,470,719);  lam  beta  signal 
sequence  (e.g.,  U.S.  Patent  No.  5,389,529);  B.  brevis  signal  sequence  (e.g.,  U.S.  Patent  No. 
5,232,841);  and  P.  pastoris  signal  sequence  (e.g.,  U.S.  Patent  No.  5,268,273)). 

25 

A  nucleic  acid  reagent  sometimes  contains  one  or  more  origin  of  replication  (ORI)  elements.  In 
some  embodiments,  a  template  comprises  two  or  more  ORIs,  where  one  functions  efficiently  in  one 
organism  (e.g.,  a  bacterium)  and  another  functions  efficiently  in  another  organism  (e.g.,  a 
eukaryote).  A  nucleic  acid  reagent  often  includes  one  or  more  selection  elements.  Selection 
30  elements  often  are  utilized  using  known  processes  to  determine  whether  a  nucleic  acid  reagent  is 
included  in  a  cell.  In  some  embodiments,  a  nucleic  acid  reagent  includes  two  or  more  selection 
elements,  where  one  functions  efficiently  in  one  organism  and  another  functions  efficiently  in 
another  organism. 
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Examples  of  selection  elements  include,  but  are  not  limited  to,  (1)  nucleic  acid  segments  that 
encode  products  that  provide  resistance  against  otherwise  toxic  compounds  (e.g.,  antibiotics);  (2) 
nucleic  acid  segments  that  encode  products  that  are  otherwise  lacking  in  the  recipient  cell  (e.g., 
essential  products,  tRNA  genes,  auxotrophic  markers);  (3)  nucleic  acid  segments  that  encode 
5  products  that  suppress  the  activity  of  a  gene  product;  (4)  nucleic  acid  segments  that  encode 
products  that  can  be  readily  identified  (e.g.,  phenotypic  markers  such  as  antibiotics  (e.g.,  (3- 
lactamase),  (3-galactosidase,  green  fluorescent  protein  (GFP),  yellow  fluorescent  protein  (YFP),  red 
fluorescent  protein  (RFP),  cyan  fluorescent  protein  (CFP),  and  cell  surface  proteins);  (5)  nucleic 
acid  segments  that  bind  products  that  are  otherwise  detrimental  to  cell  survival  and/or  function;  (6) 

1 0  nucleic  acid  segments  that  otherwise  inhibit  the  activity  of  any  of  the  nucleic  acid  segments 

described  in  Nos.  1-5  above  (e.g.,  antisense  oligonucleotides);  (7)  nucleic  acid  segments  that  bind 
products  that  modify  a  substrate  (e.g.,  restriction  endonucleases);  (8)  nucleic  acid  segments  that 
can  be  used  to  isolate  or  identify  a  desired  molecule  (e.g.,  specific  protein  binding  sites);  (9) 
nucleic  acid  segments  that  encode  a  specific  nucleotide  sequence  that  can  be  otherwise  non- 
15  functional  (e.g.,  for  PCR  amplification  of  subpopulations  of  molecules);  (10)  nucleic  acid  segments 
that,  when  absent,  directly  or  indirectly  confer  resistance  or  sensitivity  to  particular  compounds; 

(11)  nucleic  acid  segments  that  encode  products  that  either  are  toxic  (e.g.,  Diphtheria  toxin)  or 
convert  a  relatively  non-toxic  compound  to  a  toxic  compound  (e.g.,  Herpes  simplex  thymidine 
kinase,  cytosine  deaminase)  in  recipient  cells;  (12)  nucleic  acid  segments  that  inhibit  replication, 

20  partition  or  heritability  of  nucleic  acid  molecules  that  contain  them;  and/or  (13)  nucleic  acid 

segments  that  encode  conditional  replication  functions,  e.g.,  replication  in  certain  hosts  or  host  cell 
strains  or  under  certain  environmental  conditions  (e.g.,  temperature,  nutritional  conditions,  and  the 
like). 

25  Nucleic  acid  reagents  can  comprise  naturally  occurring  sequences,  synthetic  sequences,  or 
combinations  thereof.  Certain  nucleotide  sequences  sometimes  are  added  to,  modified  or 
removed  from  one  or  more  of  the  nucleic  acid  reagent  elements,  such  as  the  promoter,  5’UTR, 
target  sequence,  or  3’UTR  elements,  to  enhance  or  potentially  enhance  transcription  and/or 
translation  before  or  after  such  elements  are  incorporated  in  a  nucleic  acid  reagent.  Certain 
30  embodiments  are  directed  to  a  process  comprising:  determining  whether  any  nucleotide 

sequences  that  increase  or  potentially  increase  transcription  efficiency  are  not  present  in  the 
elements,  and  incorporating  such  sequences  into  the  nucleic  acid  reagent.  A  nucleic  acid  reagent 
can  be  of  any  form  useful  for  the  chosen  expression  system. 
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In  some  embodiments,  a  nucleic  acid  reagent  sometimes  can  be  an  isolated  nucleic  acid  molecule 
which  may  comprise  a  recombinant  expression  system,  which  expression  system  can  comprise  a 
nucleotide  sequence  encoding  a  toxic  or  therapeutic  RNA  or  protein,  or  an  RNA  or  protein  that 
participates  in  generating  a  toxin  or  therapeutic  agent  operably  linked  to  a  heterologous  promoter 
5  which  promoter  is  preferentially  activated  in  solid  tumors  in  living  organisms.  In  some 

embodiments,  the  promoter  sequence  can  be  a  naturally  occurring  nucleotide  sequence.  In 
certain  embodiments,  a  nucleic  acid  reagent  sometimes  can  be  two  or  more  isolated  nucleic  acid 
molecules  which  may  comprise  a  recombinant  expression  system,  which  expression  system  can 
comprise  two  or  more  nucleotide  sequences  encoding  toxic  or  therapeutic  RNA’s  or  proteins,  or 
10  RNA’s  or  proteins  that  participate  in  generating  a  toxin  or  therapeutic  agent  operably  linked  to  two 
or  more  heterologous  promoters  which  promoters  is  preferentially  activated  in  solid  tumors  in  living 
organisms.  In  some  embodiments,  the  isolated  nucleic  acid  of  the  recombinant  expression  system 
is  a  promoter  nucleic  acid.  In  certain  embodiments,  the  promoter  is  an  Enterobacteriaceae 
promoter,  and  in  some  embodiments,  the  promoter  is  a  Salmonella  promoter. 

15 

Promoters 

A  promoter  element  typically  comprises  a  region  of  DNA  that  can  facilitate  the  transcription  of  a 
particular  gene,  by  providing  a  start  site  for  the  synthesis  of  RNA  corresponding  to  a  gene. 

20  Promoters  often  are  located  near  the  genes  they  regulate,  are  located  upstream  of  the  gene  (e.g., 
5’  of  the  gene),  and  are  on  the  same  strand  of  DNA  as  the  sense  strand  of  the  gene,  in  some 
embodiments.  A  promoter  often  interacts  with  a  RNA  polymerase,  an  enzyme  that  catalyses 
synthesis  of  nucleic  acids  using  a  preexisting  nucleic  acid.  When  the  template  is  a  DNA  template, 
an  RNA  molecule  is  transcribed  before  protein  is  synthesized.  Promoter  elements  can  be  found  in 
25  prokaryotic  and  eukaryotic  organisms 

A  promoter  element  generally  is  a  component  in  an  expression  system  comprising  a  nucleic  acid 
reagent.  An  expression  system  often  can  comprise  a  nucleic  acid  reagent  and  a  suitable  host  for 
expression  of  the  nucleic  acid  reagent.  For  example,  an  expression  system  may  comprise  a 
30  heterologous  promoter  operably  linked  to  a  toxin  gene,  carried  on  a  nucleic  acid  reagent  that  is 
expressed  in  a  bacterial  host,  in  some  embodiments.  Promoter  elements  isolated  using  methods 
described  herein  may  be  recognized  by  any  polymerase  enzyme,  and  also  may  be  used  to  control 
the  production  of  RNA  of  the  therapeutic  agent  or  pharmaceutical  composition  operably  linked  to 
the  promoter  element  in  the  nucleic  acid  reagent.  In  some  embodiments,  additional  5’  and/or  3’ 
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UTR’s  may  be  included  in  the  nucleic  acid  reagent  to  enhance  the  efficiency  of  the  isolated 
promoter  element. 

Methods  described  herein  can  be  used  to  identify  a  promoter  preferentially  activated  in  tumor 
tissue.  In  some  embodiments  the  method  comprises;  (a)  providing  a  library  of  expression  systems 
5  each  comprising  a  nucleotide  sequence  encoding  a  detectable  protein  operably  linked  to  a  different 
candidate  promoter;  (b)  providing  the  library  to  solid  tumor  tissue  and  to  normal  tissue;  (c) 
identifying  cells  from  each  tissue  that  show  high  levels  of  expression  of  the  detectable  protein;  and 
(d)  obtaining  the  expression  systems  from  the  cells  that  produce  greater  levels  of  detectable 
protein  in  tumor  tissue  as  compared  to  normal  tissue,  and  identifying  the  promoters  of  the 
10  expression  system.  In  some  embodiments,  the  method  further  comprises  scoring  the  promoters 
identified  in  (d)  (e.g.,  by  detecting  a  detectable  protein,  GFP  for  example).  In  certain  embodiments, 
the  library  is  provided  in  recombinant  host  cells.  In  some  embodiments,  the  library  of  DNA 
fragments  ranged  in  size  from  about  25  base  pairs  to  about  10,000  base  pairs  in  length.  In  some 
embodiments,  the  fragments  can  be  randomly  sized  fragments.  In  certain  embodiments,  the 
15  fragments  can  be  an  ordered  set  of  specific  sequences  in  a  particular  size  range. 

In  some  embodiments,  the  promoters  are  Salmonella  promoters  and  the  recombinant  host  cells 
are  Salmonella.  In  certain  embodiments,  the  candidate  promoters  are  from  bacteria,  or  are  80%  or 
more  identical  to  promoters  from  bacteria.  That  is,  the  candidate  promoters  can  be  at  least  80%  or 
20  more,  81%  or  more,  82%  or  more,  83%  or  more,  84%  or  more,  85%  or  more,  86%  or  more,  87%  or 
more,  88%  or  more,  89%  or  more,  90%  or  more,  91%  or  more,  92%  or  more,  93%  or  more,  94%  or 
more,  95%  or  more,  96%  or  more,  97%  or  more,  98%  or  more,  or  99%  or  more  identical  to 
promoters  from  bacteria.  In  some  embodiments,  the  bacteria  are  Enterobacteriaceae  (e.g., 
Salmonella). 

25 

Detailed  experimental  procedures  for  construction  of  promoter  trap  constructs  and  libraries  are 
presented  below  in  Example  1  and  in  FIG.  1.  FIG.  1  is  a  flow  diagram  outlining  how  the  libraries 
were  enriched  for  promoter  sequences  preferentially  activated  in  solid  tumors.  The  initial  library 
was  constructed  by  ligating  sonicated,  end  repaired  Salmonella  genomic  DNA,  size  selected  for 
30  fragments  300  to  500  base  pairs  in  length  into  a  promoter  trap  construct  upstream  of  a 

promoterless  green  fluorescent  protein  (GFP)  sequence.  Although  GFP  was  the  detectable  protein 
used  herein,  due  to  ease  of  detection,  any  detectable  protein  that  can  be  easily  and  efficiently 
detected  can  be  used  in  place  of  GFP.  Non-limiting  examples  of  detectable  proteins  are  other 
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fluorescent  proteins,  peptides  or  proteins  that  inactivate  antibiotics  (e.g.,  beta-lactamase,  the 
enzyme  responsible  for  penicillin  resistance,  for  example)  and  the  like. 

The  library  contained  in  recombinant  cells  can  be  injected  into  rodents  (e.g.,  mice,  rats)  bearing 
solid  tumor  xenografts,  as  described  below.  Enrichment  for  promoters  preferentially  active  in 
5  tumors  was  performed  as  described  in  Example  2.  The  experimental  results  from  the  enrichment 
process  are  presented  in  Tables  2-7.  Tables  2-7  contain  sequences  of  promoters  active  in  normal 
tissue  (e.g.,  spleen),  promoters  active  in  both  normal  tissue  and  solid  tumors  and  promoters 
preferentially  activated  in  solid  tumors  (see  Tables  2A,  2B,  6A  and  6B). 

10  The  sequences  isolated  using  the  methods  described  herein  were  mapped  to  genome  positions  as 
described  in  Example  2,  using  high  density,  high  resolution  arrays  constructed  as  described  in 
Example  1 .  The  nucleotide  position  of  the  library  construct  that  had  the  highest  enrichment  signal 
for  a  particular  library  construct  is  given  in  the  Tables  as  the  nucleotide  position.  The  nucleotide 
position  may  correspond  to  the  start  site  of  the  isolated  promoter  element.  Definitive  promoter  start 
15  site  mapping  can  be  performed  using  a  suitable  method.  One  method  is  5’  RACE  (e.g.,  rapid 

amplification  of  cDNA  ends),  for  example,  which  can  be  routinely  performed.  5’  RACE  can  be  used 
to  identify  the  first  nucleotide  in  an  mRNA  or  other  RNA  molecule  and  also  be  used  to  identify 
and/or  clone  a  gene  when  only  a  small  portion  of  the  sequence  is  known.  An  example  of  a  5’ 

RACE  procedure  suitable  for  identifying  a  transcription  start  site  from  promoter  elements  isolated 
20  using  the  methods  described  herein  is  Schramm  et  al,  “A  simple  and  reliable  5’  RACE  approach”, 
Nucleic  Acids  Research,  28(22):e96,  2000. 

Where  identifiable,  gene  names  and  functions  are  presented  along  with  the  sequence  information 
for  the  isolated  nucleic  acid  sequences  that  exhibited  promoter  activity  (e.g.,  showed  at  least  a  two 
25  fold  increase  in  detectable  GFP  over  input).  Table  6  describes  the  distribution  of  sequences 

isolated  using  the  methods  described  herein.  The  majority  of  sequences  that  exhibited  promoter 
activity  (e.g.,  transcription  of  GFP)  were  isolated  from  intergenic  sequences.  This  observation  is  in 
keeping  with  the  finding  that  many  bacterial  promoters  lie  outside  of  gene  coding  sequences. 
Further  distribution  results  are  discussed  in  Example  2. 

30 

To  confirm  the  tumor  specificity  of  the  isolated  sequences,  a  number  of  clones  were  further 
investigated  (see  Example  2,  Confirmation  of  tumor  specificity  in  vivo).  In  particular,  Clone  ID  Nos. 
10,  28,  45,  44,  and  84  were  further  investigated  in  vivo  as  described  in  Example  2.  Three  clones  in 
particular  were  induced  to  a  greater  degree  in  tumor  as  compared  to  spleen  (e.g.,  Clones  10,  28 
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and  45).  FIG.  2  illustrates  the  expression  of  GFP  from  these  clones  in  vivo  in  whole  mice  and  in 
tumor  alone.  FIG.  2  presents  the  microscopic  imaging  (Olympus  OVIOO  small  animal  imaging 
system)  of  fluorescent  bacteria  in  mouse  spleen  and  tumors.  Clone  C28  maps  to  the  upstream 
intergenic  region  of  the  flhB  gene,  clone  CIO  maps  to  the  pefL  intergenic  region,  and  C45  maps  to 
5  the  intergenic  region  of  the  gene  ansB.  The  number  of  colony  forming  units  for  each  trial  is  given 
below  the  image,  to  account  for  differences  in  signal  intensities.  The  number  of  colony  forming 
units  isolated  in  each  trial  was  approximately  equal,  and  therefore  did  not  contribute  to  the 
differences  in  intensity  seen  in  the  images. 

10  Certain  promoter  elements  can  be  regulated  in  a  conditional  manner.  That  is,  promoters 

sometimes  can  be  turned  on,  turned  off,  up-regulated  or  down-regulated  by  the  influence  of  certain 
environmental,  nutritional,  or  internal  signals  (e.g.,  heat  inducible  promoters,  light  regulated 
promoters,  feedback  regulated  promoters,  hormone  influenced  promoters,  tissue  specific 
promoters,  oxygen  and  pH  influenced  promoters  and  the  like,  for  example).  Promoters  influenced 
15  by  environmental,  nutritional  or  internal  signals  frequently  are  influenced  by  a  signal  (direct  or 
indirect)  that  binds  at  or  near  the  promoter  and  increases  or  decreases  expression  of  the  target 
sequence  under  certain  conditions  and/or  in  specific  tissues.  Certain  promoter  elements  can  be 
regulated  in  a  selective  manner,  as  noted  above.  In  some  embodiments,  the  promoter  does  not 
include  a  nucleotide  sequence  to  which  a  bacterial  (e.g.,  gram  negative  (e.g.,  E.  coli,  Salmonella) 
20  oxygen-responsive  global  transcription  factor  (FNR)  binds  substantially.  In  certain  embodiments, 
the  promoter  sequence  does  not  include  one  or  more  of  the  following  subsequences: 
GGATAAAAGT  GACCT  GACGCAATATTT  GT  CTTTT  CTT  GCTTAATAAT  GTT  GT  CA, 

GG  AT  AAAAGT  G  ACCT  G  AC  GC  AAT  ATTT  GT  CTTTT  CTT  GCTTT  AT  AAT  GTTGT  CA, 

GG  AT  AAAATT  GAT  CT  GAAT  C  AAT  ATTT  GT  CTTTT  CTT  GCTT  AAT  AAT  GTT  GT  CA,  or 
25  GGAT  AAAAGGAT  CCGAC  GCAAT  ATT  GT  CTTTT  CTT  GCTTAAT  AAT  GTT  GT  CA. 

In  some  embodiments,  the  promoter  sequence  is  not  identical  to  a  bacterial  promoter  that 
regulates  the  bacterial  pepT gene. 

Non-limiting  examples  of  selective  agents  that  can  be  used  to  selectively  regulate  promoters  in 
30  therapeutic  methods  using  expression  systems  and  promoter  elements  described  herein  include, 
(1)  nucleic  acid  segments  that  encode  products  that  provide  resistance  against  otherwise  toxic 
compounds  (e.g.,  antibiotics);  (2)  nucleic  acid  segments  that  encode  products  that  are  otherwise 
lacking  in  the  recipient  cell  (e.g.,  essential  products,  tRNA  genes,  auxotrophic  markers);  (3)  nucleic 
acid  segments  that  encode  products  that  suppress  the  activity  of  a  gene  product;  (4)  nucleic  acid 
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segments  that  encode  products  that  can  be  readily  identified  (e.g.,  phenotypic  markers  such  as 
antibiotics  (e.g.,  3-lactamase),  |3-galactosidase,  green  fluorescent  protein  (GFP),  yellow  fluorescent 
protein  (YFP),  red  fluorescent  protein  (RFP),  cyan  fluorescent  protein  (CFP),  and  cell  surface 
proteins);  (5)  nucleic  acid  segments  that  bind  products  that  are  otherwise  detrimental  to  cell 
5  survival  and/or  function;  (6)  nucleic  acid  segments  that  otherwise  inhibit  the  activity  of  any  of  the 
nucleic  acid  segments  described  in  Nos.  1-5  above  (e.g.,  antisense  oligonucleotides);  (7)  nucleic 
acid  segments  that  bind  products  that  modify  a  substrate  (e.g.,  restriction  endonucleases);  (8) 
nucleic  acid  segments  that  can  be  used  to  isolate  or  identify  a  desired  molecule  (e.g.,  specific 
protein  binding  sites);  (9)  nucleic  acid  segments  that  encode  a  specific  nucleotide  sequence  that 
10  can  be  otherwise  non-functional  (e.g.,  for  PCR  amplification  of  subpopulations  of  molecules);  (10) 
nucleic  acid  segments  that,  when  absent,  directly  or  indirectly  confer  resistance  or  sensitivity  to 
particular  compounds;  (11)  nucleic  acid  segments  that  encode  products  that  either  are  toxic  (e.g., 
Diphtheria  toxin)  or  convert  a  relatively  non-toxic  compound  to  a  toxic  compound  (e.g.,  Herpes 
simplex  thymidine  kinase,  cytosine  deaminase)  in  recipient  cells;  (12)  nucleic  acid  segments  that 
15  inhibit  replication,  partition  or  heritability  of  nucleic  acid  molecules  that  contain  them;  and/or  (13) 
nucleic  acid  segments  that  encode  conditional  replication  functions,  e.g.,  replication  in  certain 
hosts  or  host  cell  strains  or  under  certain  environmental  conditions  (e.g.,  temperature,  nutritional 
conditions,  and  the  like).  In  some  embodiments,  the  nucleic  acids  identified  and  isolated  using 
methods  described  herein  (e.g.,  promoter  elements  preferentially  activated  in  solid  tumors  of  living 
20  organisms)  can  be  selectively  regulated  by  administration  of  a  suitable  selective  agent,  as 
described  above  or  known  and  available  to  the  artisan. 

Methods  presented  herein  take  into  account  the  unique  environment  inside  a  tumor.  Therefore, 
while  hypoxia  induced  tumors  may  be  identified,  other  promoters  preferentially  activated  in  the 
25  unique  tumor  environment  can  also  be  identified  and  isolated.  Some  specific  classes  of  promoters 
preferentially  activated  inside  tumors  were  presented  above.  Therefore,  the  promoters  isolated 
using  methods  described  herein  may  be  preferentially  activated  under  a  wide  variety  of  regulatory 
molecules  and  conditions. 

30  Therapeutic  Agents  and  Methods  of  Treatment 

Expression  systems,  nucleic  acid  reagents  and  pharmaceutical  compositions  described  herein  that 
comprise  promoter  elements  preferentially  activated  in  solid  tumors,  or  cells  containing  the 
expression  system,  nucleic  acid  reagents  and  pharmaceutical  compositions  described  herein,  can 
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be  used  to  treat  solid  tumors  in  a  living  organism.  In  some  embodiments,  methods  for  treating 
solid  tumors  comprise  administering  to  a  subject  harboring  the  tumors  the  nucleic  acid  molecules 
or  nucleic  acid  reagents  comprising  nucleic  acid  sequences  preferentially  activated  in  tumors  (e.g., 
nucleic  acids  bearing  promoter  elements  isolated  using  the  methods  described  herein,  for 
5  example),  cells  containing  the  above  described  nucleic  acids,  or  compositions  comprising  the 

isolated  nucleic  acids.  In  some  embodiments,  the  expression  system,  nucleic  acid  reagent,  and/or 
pharmaceutical  compositions  comprise  a  nucleotide  sequence  encoding  a  toxic  or  therapeutic  RNA 
or  protein,  or  an  RNA  or  protein  that  participates  in  generating  a  desired  toxin  or  therapeutic  agent 
operably  linked  to  a  promoter  identified  by  the  methods  described  herein. 

10 

In  some  embodiments,  the  therapeutic  RNA  or  protein  can  be  an  enzyme  which  catalyzes  the 
activation  of  a  prodrug.  That  is,  the  enzyme  can  be  operably  linked  to  a  promoter  element 
preferentially  activated  in  solid  tumors.  The  nucleic  acid  reagent  /  expression  system  / 
pharmaceutical  composition  contained  in  a  recombinant  cell  can  be  administered  along  with  the 
15  prodrug  (e.g.,  administered  by  intramuscular  or  intravenous  injection,  for  example).  The  avirulent 
recombinant  host  cell  sometimes  can  preferentially  colonize  the  solid  tumor,  and  the  prodrug  will 
remain  inactive  in  all  tissues  except  inside  the  solid  tumor,  due  to  the  enzyme  only  being  produced 
by  recombinant  cells  that  have  colonized  the  tumor,  due  to  the  heterologous  promoter  that  is 
preferentially  activated  in  the  solid  tumors  of  living  organisms.  Non-limiting  examples  of  this  type 
20  of  combination  are  the  enzymes  nitroreductase  or  quinone  reductase  2  and  the  prodrug  CB1954 
(5-[aziridin-1-yl]-2,4-dinitrobenzamide),  or  Cytochrome  P450  enzymes  2B1, 2B4,  and  2B5  and  the 
anticancer  prodrugs  Cyclphosphamide  and  Ifosfamide.  Further  non-limiting  examples  of  enzyme 
prodrug  combinations  can  be  found  in  Rooseboom  et  al,  “Enzyme-Catalyzed  Activation  of 
Anticancer  Prodrugs”,  Pharmacol.  Rev.  56:53-102,  2004,  hereby  incorporated  by  reference  in  its 
25  entirety. 

In  certain  embodiments,  bacterial  two  component  toxins  can  also  be  utilized  as  the  toxic  or 
therapeutic  proteins  or  peptide  sequences  operably  linked  to  the  promoters  isolated  using  methods 
described  herein.  Non-limiting  examples  of  bacterial  toxins  suitable  for  use  in  compositions 
30  described  herein  were  presented  above.  Several  of  these  toxins  offer  attractive  modes  of  toxicity 
that  when  combined  with  the  expression  only  inside  a  solid  tumor,  may  offer  novel  therapies  for 
inhibiting  tumor  growth.  For  example,  Diphtheria  toxin  and  Pseudomonas  Exotoxin  A  are  both  two 
component  toxins  (e.g.,  has  two  distinct  peptides)  that  inhibit  protein  synthesis,  resulting  in  cell 
death.  The  nucleic  acid  sequences  of  these  toxins  could  be  operably  linked  to  promoters 
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preferentially  activated  in  solid  tumors,  and  administered  to  a  subject  harboring  a  solid  tumor,  with 
little  or  no  toxicity  to  the  organism  outside  of  the  targeted  solid  tumor. 

In  some  embodiments,  multiple  nucleic  acid  reagents  can  be  administered,  where  each  nucleic 
5  acid  reagent  comprises  a  nucleic  acid  sequence  for  a  gene  in  a  metabolic  pathway,  the  pathway 
producing  a  therapeutic  agent  that  can  inhibit  tumor  growth.  In  certain  embodiment  the  nucleic 
acid  reagents  can  have  the  same  or  different  heterologous  promoters  preferentially  activated  in 
tumors  operably  linked  to  the  sequences  for  the  metabolic  pathway  genes. 

10  In  certain  embodiments,  the  expression  systems  described  herein  may  generate  RNA’s  or  proteins 
that  are  themselves  toxic,  or  RNA’s  or  proteins  that  are  known  to  have  a  therapeutic  effect  by 
selective  toxicity  to  solid  tumors.  A  non-limiting  example  of  a  protein  known  to  have  a  therapeutic 
effect  by  selective  toxicity  to  solid  tumors  is  Methioninase,  which  is  known  to  be  selectively 
inhibitory  to  tumors.  Additional  known  toxic  proteins  include,  but  are  not  limited  to,  ricin,  abrin,  and 
15  the  like.  In  addition  to  proteins  that  are  toxic  per  se,  the  expression  systems  may  generate 

proteins  that  convert  non-toxic  compounds  into  toxic  ones.  A  non-limiting  example  is  the  use  of 
lyases  to  liberate  selenium  from  selenide  analogs  of  sulfur-containing  amino  acids.  Other  non¬ 
limiting  examples  include  generation  of  enzymes  that  liberate  active  compounds  from  inactive 
prodrugs.  For  example,  derivatized  forms  of  palytoxin  can  be  provided  that  are  non-toxic  and  the 
20  expression  system  used  to  produce  enzymes  that  convert  the  inactive  form  to  the  toxic  compound. 
In  addition,  proteins  that  attract  systems  in  the  host  can  also  be  expressed,  including 
immunomodulatory  proteins  such  as  interleukins. 

The  subjects  that  can  benefit  from  the  embodiments,  methods  and  compositions  described  herein 
25  include  any  subject  that  harbors  a  solid  tumor  in  which  the  promoter  operably  linked  to  a 
therapeutic  agent  is  preferentially  active.  Human  subjects  can  be  appropriate  subjects  for 
administering  the  compositions  described  herein.  The  methods  and  compositions  described  herein 
can  also  be  applied  to  veterinary  uses,  including  livestock  such  as  cows,  pigs,  sheep,  horses, 
chickens,  ducks  and  the  like.  The  methods  and  compositions  described  herein  can  also  be  applied 
30  to  companion  animals  such  as  dogs  and  cats,  and  to  laboratory  animals  such  as  rabbits,  rats, 
guinea  pigs,  and  mice. 
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The  tumors  to  be  treated  include  all  forms  of  solid  tumor,  including  tumors  of  the  breast,  ovary, 
uterus,  prostate,  colon,  lung,  brain,  tongue,  kidney  and  the  like.  Localized  forms  of  highly 
metastatic  tumors  such  as  melanoma  can  also  be  treated  in  this  manner. 

5  Thus,  the  methods  and  compositions  described  herein  may  provide  a  selective  means  for 

producing  a  therapeutic  or  cytotoxic  effect  locally  in  tumor  or  other  target  tissue.  As  the  encoded 
RNA’s  or  proteins  are  produced  uniquely  or  preferentially  in  tumor  tissue,  side  effects  due  to 
expression  in  normal  tissue  is  minimized. 

10  Nucleic  acid  molecules  may  be  formulated  into  pharmaceutical  compositions  for  administration  to 
subjects.  The  nucleic  acid  molecules  sometimes  are  transfected  into  suitable  cells  that  provide 
activating  factors  for  the  promoter.  In  some  cases,  the  tumor  cells  themselves  may  contain 
workable  activators.  If  the  promoter  is  a  bacterial  promoter,  bacteria,  such  as  Salmonella  itself, 
may  be  used.  Any  cell  closely  related  to  that  from  which  the  promoter  derives  is  a  suitable 

15  candidate.  A  preferred  mode  of  administration  is  the  use  of  bacteria  that  preferentially  reside  in 
hypoxic  environments  of  solid  tumors.  The  compositions  which  contain  the  nucleic  acids,  vectors, 
bacteria,  cells,  etc.,  sometimes  are  administered  parenterally,  such  as  through  intramuscular  or 
intravenous  injection.  The  compositions  can  also  be  directly  injected  into  the  solid  tumor.  Nucleic 
acids  sometimes  are  administered  in  naked  form  or  formulated  with  a  carrier,  such  as  a  liposome. 

20  A  therapeutic  formulation  may  be  administered  in  any  convenient  manner,  such  as  by 

electroporation,  injection,  use  of  a  gene  gun,  use  of  particles  (e.g.,  gold)  and  an  electromotive 
force,  or  transfection,  for  example.  Compositions  may  be  administered  in  vivo,  ex  vivo  or  in  vitro,  in 
certain  embodiments. 

25  As  noted  above,  ancillary  substances  may  also  be  needed  such  as  compounds  which 

activate  inducible  promoters,  substrates  on  which  the  encoded  protein  will  act,  standard  drug 
compositions  that  may  complement  the  activity  generated  by  the  expression  systems  of  the 
invention  and  the  like.  These  ancillary  components  may  be  administered  in  the  same  composition 
as  that  which  contains  the  expression  system  or  as  a  separate  composition.  Administration  may 

30  be  simultaneous  or  sequential  and  may  be  by  the  same  or  different  route.  Some  ancillary  agents 
may  be  administered  orally  or  through  transdermal  or  transmucosal  administration. 


25 


PATENT 

VIV-1001-PC 

The  pharmaceutical  compositions  may  contain  additional  excipients  and  carriers  as  is  known  in  the 
art.  Suitable  diluents  and  carriers  are  found,  for  example,  in  Remington’s  Pharmaceutical 
Sciences,  latest  edition,  Mack  Publishing  Co.,  Easton,  PA,  incorporated  herein  by  reference. 

5  Examples 

The  examples  set  forth  below  illustrate  certain  embodiments  and  do  not  limit  the  invention. 

Example  1:  Materials  and  Methods 

10 

Vector  Construction. 

Promoter  trap  plasmids  with  TurboGFP  (e.g.,  promoter  reporter  plasmid  comprising  a  destabilized 
TurboGFP,  World  Wide  Web  URL  evrogen.com/TurboGFP.shtml)  were  generated  by  PCR  from 
15  the  pTurboGFP  plasmid.  The  pTurboGFP  plasmid  was  PCR  amplified  using  the  primers  Turbo- 
LVA  R1  (SEQ  ID  NO.  1 ,  see  Table  1)  and  Turbo-FI  (SEQ  ID  NO.  2,  see  Table  1)  to  generate  a 
fusion  of  the  peptide  motif  AAN DEN YALVA  (SEQ  ID  NO.  3)  to  the  3’  end  of  the  protein  (Andersen 
et  al.,  1998;  Keiler  and  Sauer,  1996).  The  PCR  product  was  digested  by  EcorRV  and  self  ligated 
to  generate  pTurboGFP-  LVA.  The  plasmids  pTurboGFP  and  pTurboGFP-LVA  were  each  double 
20  digested  by  Xhol  and  BamHI  to  remove  the  T5  promoter  sequence.  The  pairs  of  oligos  PR1-1 F  / 
PR1-1R  (SEQ  ID  NOS.  4  and  5,  respectively,  see  Table  1)  and  PRL3-1F  /  PR3-1R  (SEQ  ID  NOS. 

6  and  7,  respectively,  see  Table  1),  containing  multi-cloning  sites,  transcriptional  terminators,  and  a 
ribosomal  binding  site,  were  used  to  replace  the  T5  constitutive  promoter  of  pTurbo-GFP  and 
pTurboGFP-LVA  respectively.  Primers  Turbo-4F  and  Turbo-1  R  (SEQ  ID  NOS.  8  and  9, 

25  respectively,  see  Table  1)  were  used  to  amplify  promoter  inserts  before  and  after  FACS  sort. 


26 


PATENT 

VIV-1001-PC 


Table  1 .  Sequences  of  oligonucleotides  use  to  construct  promoter  trap  constructs 


Oligos 

Sequence 

Turbo-LVA  R1 

SEQ.ID.NO.  1: 

ACT  GAT  AT  CTT  AAGCT  ACT  AAAGCGT  AGTTTT  CGTCGTTT  GCT  GCAG  GCCTT 

TCTTCACCGGCATCTGCA 

T  urbo-FI 

SEQ.ID.NO.  2:  CT  GAT  AT  C  GCTTG  GACTCCT  GTT  GAT  AG  AT 

PRL1-1F 

SEQ.ID.NO.  4: 

TCGAGAGATCTCCATCGAATTCGTGGGTCGACCCCGGGAGGCCTAAAGAG 

G  AG  AAATTAACTATG  AG  AG  G  ATC  G  G 

PRL1-1R 

SEQ.ID.NO.  5: 

GATCCCGATCCTCTCATAGTTAATTTCTCCTCTTTAGGCCTCCCGGGGTCGA 

CCCACGAATT  CGAT  GGAG  AT  CT  C 

PRL3-1F 

SEQ.ID.NO.  6: 

T  C  G  AGC  G  AAATT  AAT  AC  G  ACT  C  AC  T  AT  AG  G  GAG  AC  CCCCGGGTT  AAC  ACT  A 

GT  AAAG  AG  G  AG  AAATT  AACT  AT  GAG  AG  GAT  C  G  G 

PRL3-1R 

SEQ.ID.NO.  7: 

GAT  CCCG  AT  CCT  CT  CAT  AGTT  AATTT  CT  C  CT  CTTT  ACTAGT  GTT  AACCC  GG  G 

GGTCTCCCTATAGTGAGTCGTATTAATTTCGC 

Turbo-4 F 

SEQ.ID.NO.  8:  AAAGT GCCACCT GACGT CT 

Turbo-1  R 

SEQ.ID.NO.  9:  CCACCAGCTCGAACTCCAC 

Promoter  Library  Construction. 

5 

10  pg  of  Salmonella  enterica  serovar  typhimurium  14028  (S.  enterica.  Typhimurium  14028,  ATCC) 
genomic  DNA  was  eluted  in  TE  buffer  and  sonicated  with  3  pulses  for  5  seconds  on  ice.  Sonicated 
DNA  was  precipitated  with  2  volumes  ethanol  and  0.1  volumes  of  Sodium  Acetate  (100  mM)  and 
separated  on  a  1  %  agarose  gel.  300  to  500  base  pair  (bp)  fragments  were  recovered  from  the  gel 
10  and  DNA  ends  were  repaired  by  T4  DNA  polymerase.  Repaired  fragments  were  cloned  in  a 
dephosphorylated  promoterless  GFP  plasmid  upstream  of  a  Stul  and  Hpal  restriction  site  in  the 
stable  and  destabilized  GFP,  respectively.  These  fragments  were  located  just  upstream  of  the 
GFP  start  codon,  and  were  therefore  capable  of  promoting  transcription,  depending  on  their 
sequence  properties.  The  number  of  independent  clones  was  approximately  120,000  for  the 
15  stable  variant  and  60,000  for  the  unstable  variant.  The  two  libraries  were  mixed  1 :1  and 
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designated  “Library-0”.  This  library  contained  about  180,000  independent  Typhimurium  fragments, 
representing  about  15-fold  coverage  of  the  4.8  Mb  genome  with  clone  spacing  averaging  every  25 
bases.  Hybridization  to  a  Salmonella  array  showed  that  library-0  represented  sequences  from 
almost  the  entire  genome. 

5 

Array  Design. 

A  high-resolution  array  was  generated  using  Roche  NimbleGen  high  definition  array  technology 
(World  Wide  Web  URL  nimblegen.com/products/index.html).  The  array  comprised  387,000  46-mer 
10  to  50-mer  oligonucleotides,  with  length  adjusted  to  generate  similar  predicted  melting  temperatures 
(Tm).  377,230  of  these  probes  were  designed  based  on  the  Typhimurium  LT2  genome  (NC- 
003197;  McClelland  et  al,  “Complete  genome  sequence  of  Salmonella  enterica  serovar 
Typhimurium  LT2”,  Nature  413:852-856,  2001).  Oligonucleotides  tiled  the  genome  every  12 
bases,  on  alternating  strands.  Thus,  each  base  pair  in  the  genome  was  represented  in  four  to  six 
15  oligonucleotides,  with  two  to  three  oligonucleotides  on  each  strand.  Probes  representing  the  three 
LT2  regions  not  present  in  the  genome  of  the  very  closely  related  14028s  strain  (phages  Fels-1 
and  Fels-2,  STM3255-3260)  and  greater  than  9,000  other  oligonucleotides  were  included  as 
controls  for  hybridization  performance,  synthesis  performance,  and  grid  alignment.  The 
oligonucleotides  were  distributed  in  random  positions  across  the  array. 

20 

Fluorescence  Activated  Cell  Sorting  (FACS)  Analysis. 

Bacteria  harboring  the  constitutive  pTurboGFP  plasmid  were  used  as  a  positive  control  for  the 
Becton  Dickinson  FACSAria  FACS  system.  Side  scatter  ssc-w  (X-axis)  and  ssc-H  (Y-axis)  were 
25  used  to  gate  on  single  bacterial  cells.  GFP-fluorescence  (GFP-A)  on  the  X-axis  and  auto¬ 
fluorescence  (PE)  on  the  Y-axis  permitted  discrimination  between  green  Salmonella  cells  and  other 
fluorescent  particles  of  different  sizes.  Fluorescent  particles  tended  to  be  distributed  on  the 
diagonal  of  the  GFP-A/PE  plot,  and  had  a  fluorescence/auto-fluorescence  ratio  close  to  1. 

Individual  GFP-positive  Salmonella  cells  had  a  higher  ratio  of  fluorescence/auto-fluorescence  and 
30  tended  to  be  distributed  close  to  the  X-axis  of  the  GFP-A/PE  plot.  Putative  GFP-positive  events  in 
the  window  enriched  for  GFP-expressing  Salmonella  were  sorted  at  a  speed  of  '5,000  total  events 
per  second. 
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Example  2:  Experimental  Results 
Enrichment  of  Active  Promoters  in  Spleen. 

5  To  identify  active  Salmonella  promoters  in  the  spleen,  five  tumor-free  nude  mice  were  i.v.  injected 
with  107  colony  forming  units  (cfu)  of  Salmonella  carrying  a  promoter  library.  This  library, 
designated  “library-0”,  consisted  of -180,000  plasmid  clones  each  containing  a  fragment  of  the 
Salmonella  genome  upstream  of  a  promoterless  GFP  gene  (described  above).  Two  days  after 
injection,  spleens  were  combined,  homogenized  on  ice,  and  treated  thrice  with  PBS  containing 
10  0.1%  Triton  X -100.  An  aliquot  of  the  final  homogenized  sample  was  plated  on  Luria-Bertani  (LB) 

medium  with  50  pg/mL  of  ampicillin  (Amp)  to  determine  the  number  of  bacterial  colony-forming 
units  (cfu).  The  remainder  of  the  bacteria  in  the  sample  was  immediately  separated  by  FACS. 

Fifty  thousand  potentially  GFP-positive  events  were  sorted  and  this  sublibrary  was  grown  overnight 
in  LB+  Amp  and  designated  “library-1”.  The  spleen  was  chosen  because  it  is  the  primary  site  of 
15  Salmonella  accumulation  in  normal  mice  (Ohl  and  Miller,  “ Salmonella :  a  model  for  bacterial 
pathogenesis”,  Annu.  Rev.  Med.  52:259-274,  2001). 

Enrichment  of  Active  Promoters  in  Tumor. 

20  The  experimental  design  for  tumor  samples  is  described  in  FIG.  1 .  Five  nude  mice  bearing  human- 
PC3  prostate  tumors,  between  0.5  and  1  cm3  in  size,  were  injected  intratumorally  with  107  cfu  of 
Salmonella  promoter  library-0.  Two  days  after  injection,  tumors  were  combined,  homogenized  on 
ice  and  washed,  as  above.  An  aliquot  was  plated  to  determine  the  number  of  bacterial  colony¬ 
forming  units.  The  remainder  of  the  sample  was  immediately  separated  by  FACS.  Fifty  thousand 
25  GFP-positive  events  were  recovered  and  grown  overnight  in  LB  containing  ampicillin  (library-2).  A 
small  aliquot  of  these  bacteria  were  then  pelleted  and  resuspended  in  PBS  (106  cfu/mL)  and  FACS 
sorted.  GFP-negative  events  (106)  were  collected,  grown  in  LB  overnight,  washed  in  PBS  and 
reinjected  into  five  human-PC3  tumors  in  nude  mice.  After  2  days,  bacteria  were  extracted  from 
tumors  and  50,000  GFP-positive  events  were  FACS  sorted  and  expanded  in  LB+  Amp  (library-3). 
30  A  biological  replicate  of  library-3  was  obtained  by  repeating  the  experiment  from  the  beginning 
using  library-0.  This  was  designated  library-4. 
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Genome  wide  Survey  on  Tumor-Activated  Promoters  Using  Arrays. 

Plasmid  DNA  was  extracted  from  the  original  promoter  library  (library-0),  from  clones  activated  in 
spleen  (library-1),  and  from  clones  activated  in  subcutaneous  PC3  tumors  in  nude  mice  after  one 
5  (library-2)  or  two  passages  (library-3  and  library-4)  in  tumors.  Promoter  sequences  were  recovered 
by  PCR  using  primers  Turbo-4F  and  Turbo-1  R  (see  Table  1,  presented  above),  and  the  PCR 
product  was  labeled  by  CY  5  (library-0)  and  CY  3  (library-1,  library-2,  library-3,  library-4).  The 
resulting  products  were  then  hybridized  to  the  array  of  387,000  oligonucleotide  sequences 
(described  above  in  Array  Design)  positioned  at  12-base  intervals  around  the  Typhimurium 
10  genome  (using  the  manufacturer’s  protocol)  (Panthel  et  al,  “Prophylactic  anti-tumor  immunity 
against  a  murine  fibrosarcoma  triggered  by  the  Salmonella  type  III  secretion  system”,  Microbes 
Infect.  8:2539-2546,  2006).  Spot  intensities  were  normalized  based  on  total  signal  in  each 
channel.  The  enrichment  of  genomic  regions  was  measured  by  the  intensity  ratio  of  the  tumor  or 
the  spleen  sample  versus  the  input  library  (library-0).  A  moving  median  of  the  ratio  of  tumor  versus 
1 5  input  library  from  1 0  data  points  (-170  bases)  was  calculated  across  the  genome. 

The  highest  median  of  each  intergenic  and  intragenic  region  was  chosen  to  represent  the  most 
highly  overrepresented  region  of  that  promoter  or  gene  in  the  tested  library.  Using  a  threshold  of 
(exp  /  control)  greater  than  or  equal  to  2,  and  enrichment  in  both  replicates  of  the  experiment 
20  (library-4,  plus  at  least  one  of  library-2  or  library-3),  there  were  86  intergenic  regions  enriched  in 
tumors  but  not  in  the  spleen  (see  Table  2A  and  2B,  presented  below),  and  154  intergenic  regions 
enriched  in  both  tumor  and  spleen  (see  Table  3A  and  3B,  presented  below).  There  were  at  least 
30  regions  enriched  in  spleen  alone  (see  Table  4,  presented  below). 
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Table  3A.  Regions  that  induce  GFP  expression  in  both  tumor  and  spleen 
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Table  4.  Intergenic  regions  that  induce  higher  GFP  expression  in  spleen  than  in  tumor 
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Sequencing  of  Promoters. 


One  hundred  and  ninety-two  clones  from  a  library  that  underwent  two  rounds  of  enrichment  in 
tumor  (library-3)  were  picked  at  random  and  sequenced,  yielding  100  different  sequences.  These 
5  were  mapped  to  the  genome  and  their  potential  regulation  (tumor-specific  activation,  or  activation 
in  both  spleen  and  tumor)  was  determined  by  comparison  with  the  microarray  data  (see  Table  5, 
presented  below).  The  clones  included  26  that  were  preferentially  activated  in  tumors,  and  40  that 
were  activated  both  in  tumor  and  spleen.  77%  of  the  tumor  enriched  clones  (20  of  26)  and  75%  of 
the  clones  induced  in  both  tumor  and  spleen  (30  of  40)  mapped  at  least  partly  to  intergenic 
10  regions.  As  expected,  none  of  these  100  clones  were  spleen-specific.  The  20  intergenic  clones 
supported  by  both  biological  replicates  on  array  experiments  are  presented  in  Tables  6A  and  6B. 


Table  5.  Microarry  status  of  active  promoter  clones  in  Salmonella 


Genome  Location 

Promoter  Status 

Not  Detected 

Active  in  Spleen  and 

Tumor 

Preferentially  Active  in 

Tumor 

Intragenic  sequences 

27 

10 

6 

Intergenic  sequences 

7 

30 

20 

15 


Table  6A.  Cloned  candidate  intergenic  tumor-specific  Salmonella  promoters 


Intergenic  regions 

Genome 
position  of 
peak  signal 

Median  ratio  of  experiment  versus  input 

Clone 

ID 

Spleen 

Tumor 

(+) 

Tumor 

(+)(-)(+) 

Tumor 

(+)(-)(+) 

Lib-1 

Lib-2 

Lib-3 

Lib-4 

STM0468  -  STM0469 

526177 

85 

0.9 

2.3 

5.5 

9.5 

STM0474  -  STM0475 

529126 

86 

1.9 

1.7 

3.2 

2.6 

STM0580  -  STM0581 

638735 

87 

0.9 

3.2 

0.3 

8.5 

STM0844  -  STM0845 

914762 

10 

0.8 

1.9 

5.8 

0.4 

STM0937  -  STM0938 

1014704 

11 

0.7 

4.2 

6.5 

10.3 

STM  1382  -  STM  1383 

1466034 

16 

0.7 

4.6 

7.4 

13.9 

STM  1529  -  STM  1530 

1606103 

20 

1.9 

5.5 

2.8 

13 

STM  1807  -  STM  1808 

1909051 

26 

1.2 

1.6 

6.5 

9.7 

STM1914  -  STM1915 

2011503 

28 

0.9 

3.9 

7.2 

7.5 

STM  1996  -  STM  1997 

2079476 

30 

1.2 

2.9 

7.4 

4 

STM2035  -  STM2036 

2114187 

31 

1.3 

5.9 

4.7 

8 

STM2261  -  STM2262 

2359663 

34 

0.6 

2.1 

3.5 

4.8 

STM2309  -  STM2310 

2417301 

36 

0.6 

2.7 

6.5 

6.3 
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STM3070  -  STM3071 

3233025 

44 

0.8 

1.4 

2.8 

3.1 

STM3106  -  STM3107 

3266543 

45 

1.1 

3.5 

4.6 

4.6 

STM3525  -  STM3526 

3688646 

55 

0.8 

3.8 

1.8 

5.6 

STM3880  -  STM3881 

4091492 

61 

0.9 

5.4 

0.1 

13.8 

STM4289  -  STM4290 

4530650 

71 

0.9 

2 

8.3 

10 

STM4418  -  STM4419 

4661108 

77 

0.8 

3.4 

8.3 

6 

STM4430  -  STM4431 

4674477 

78 

1.3 

6.1 

5.6 

8 

Table  6B.  Cloned  candidate  intergenic  tumor-specific  Salmonella  promoters  (cont’d) 


Intergenic 

regions 

Clone 

ID 

Cloned 

Promoter 

5'  gene 

5' 

gene 

orient 

3'  gene 

3’ 

gene 

orient 

Anerobic 

induction? 

Stable  / 

Unstable 

GFP 

STM0468 

STM0469 

85 

+ 

ylaB 

rpmE2 

+ 

Unstable 

STM0474 

STM0475 

86 

ybaJ 

acrB 

Stable 

STM0580 

STM0581 

87 

STM0580 

STM0581 

+ 

Stable 

STM0844 

STM0845 

10 

pfIE 

moeB 

Yes 

Unstable 

STM0937 

STM0938 

11 

hep 

ybjE 

Yes 

Unstable 

STM  1382 

STM  1383 

16 

orf408 

ttrA 

Stable 

STM  1529 

STM  1530 

20 

STM1529 

+ 

STM1530 

+ 

Stable 

STM  1807 

STM  1808 

26 

+ 

dsbB 

+ 

STM1808 

+ 

Stable 

STM1914 

STM1915 

28 

flhB 

cheZ 

Unstable 

STM  1996 

STM  1997 

30 

cspB 

umuC 

Stable 

STM2035 

STM2036 

31 

cbiA 

pocR 

Stable 

STM2261 

STM2262 

34 

napF 

eco 

+ 

Yes 

Stable 

STM2309 

36 

menD 

menF 

Stable 
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STM2310 

STM3070 

STM3071 

44 

epd 

STM3071 

+ 

Unstable 

STM3106 

STM3107 

45 

ansB 

yggN 

Yes 

Stable 

STM3525 

STM3526 

55 

+ 

glpE 

+ 

glpD 

+ 

Stable 

STM3880 

STM3881 

61 

+ 

kup 

+ 

rbsD 

+ 

Stable 

STM4289 

STM4290 

71 

phnA 

proP 

+ 

Unstable 

STM4418 

STM4419 

77 

+ 

STM4418 

STM4419 

+ 

Stable 

STM4430 

STM4431 

78 

+ 

STM4430 

STM4431 

+ 

Stable 

Some  possible  tumor  promoters  mapped  inside  annotated  genes;  23%  of  the  sequenced  clones  (6 
of  26)  and  18%  of  candidates  identified  by  microarray  (19  of  105;  see  Table  7,  presented  below). 

5  Some  “promoters”  may  be  artifacts  that  could  arise  from  a  variety  of  effects  such  as  the  inherent 
high  copy  number  of  the  plasmid  clone,  or  mutations  that  cause  the  copy  number  to  increase  or  a 
new  promoter  to  be  generated.  However,  based  on  data  from  Escherichia  coli ,  a  close  relative  of 
Salmonella,  intragenic  regions  might  indeed  contain  promoters,  based  on  evidence  from 
transcription  start  sites,  binding  sites  for  RNA  polymerase  (Reppas  et  al,  “The  transition  between 
10  transcriptional  initiation  and  elongation  in  E.  coli  is  highly  variable  and  often  rate  limiting”,  Mol.  Cell 
24:747-757,  2006,  Grainger  et  al,  “Studies  of  the  distribution  of  Escherichia  coli  cAMP-receptor 
protein  and  RNA  polymerase  along  the  E.  coli  chromosome”,  Proc.  Natl.  Acad.  Sci.  USA 
102:17693-17698,  2005),  and  sigma  factors  (Wade  et  al,  “Extensive  functional  overlap  between 
sigma  factors  in  Escherichia  coir,  Nat.  Struct.  Mol.  Biol.  13:806-814,  2006)  as  well  as  motif  finders 
15  (Tutukina  et  al,  “Intragenic  promoter-like  sites  in  the  genome  of  Eschericia  coli  discovery  and 
functional  implication”,  J.  Bioinform.  Comput.  Biol.  5:549-560,  2007).  Further  work  may  provide 
confirmatory  evidence  of  promoter  activity  in  some  cases. 

Some  weaker  promoters  may  generate  detectable  GFP  in  the  stable,  but  not  the  destabilized,  GFP 
20  plasmid  library.  Fifty  clones  sequenced  after  FACS  selection  could  be  assigned  to  either  the 

stabilized  or  destabilized  library.  Forty  of  these  were  of  the  stable  GFP  variety  versus  an  expected 
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25  of  each  type  if  there  had  been  no  bias.  Therefore,  the  destabilized  library  is,  as  expected, 
underrepresented  following  FACS. 
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Confirmation  of  tumor  specificity  of  individual  clones  in  vivo. 

Five  cloned  promoters  potentially  activated  in  bacteria  growing  in  tumor  but  not  in  the  spleen  were 
selected  to  be  individually  confirmed  in  vivo.  A  group  of  tumor-bearing  mice  and  normal  mice  were 
5  injected  i.v.  with  bacteria  containing  the  cloned  promoters.  Tumors  and  spleens  were  imaged  after 
2  days,  at  low  and  high  resolution  using  the  Olympus  OV  100  small  animal  imaging  system.  Three 
of  the  five  tumor-specific  candidates  (clones  10,  28,  and  45)  were  induced  much  more  in  tumor 
than  in  spleen.  Clone  44  produced  low  signals  and  clone  84  was  highly  expressed  in  tumor  but 
was  detectable  in  the  spleen. 

10 

Among  the  most  likely  promoters  to  be  uncovered  in  this  study  are  those  induced  by  hypoxia, 
which  is  thought  to  be  an  important  contributor  to  Salmonella  targeting  of  tumors  (Mengesha  et  al, 
“Development  of  a  flexible  and  potent  hypoxia-inducible  promoter  for  tumor-targeted  gene 
expression  in  attenuated  Salmonella",  Cancer  Biol.  Ther.  5:1 120-1128,  2006).  Salmonella 
15  promoters  induced  by  hypoxia  include  those  controlled  directly  or  indirectly  by  the  two  global 
regulators  of  anaerobic  metabolism,  Fnr  and  ArcA  (luchi  and  Weiner,  Cellular  and  molecular 
physiology  of  Escherichia  coli  'm  the  adaptation  to  aerobic  environments”,  J.  Biochem.  120:1055- 
1063,  1996). 

20  Clone  45  contains  the  promoter  region  of  ansB  ,  which  encodes  part  of  asparaginase.  In  E.  coll, 
ansB  is  positively  coregulated  by  Fnr  and  by  CRP  (cyclic  AMP  receptor  protein),  a  carbon  source 
utilization  regulator  (24).  In  S.  enterica,  the  anaerobic  regulation  of  ansB  may  require  only  CRP 
(Jennings  et  al,  “Regulation  of  the  ansB  gene  of  Salmonella  enterica",  Mol.  Miicrobiol.  9:165-172, 
1993,  Scott  et  al,  “Transcriptional  co-activation  at  the  ansB  promoters:  involvement  of  the 
25  activating  regions  of  CRP  and  FNR  when  bound  in  tandem”,  Mol.  Microbiol.  18:521-531,  1995). 

Clone  10  is  the  promoter  region  of  a  putative  pyruvate-formate-lyase  activating  enzyme  ( pfIE ). 

This  clone  was  only  observed  in  library-3,  but  enrichment  was  considerable  in  that  library  (see 
Tables  2A  and  2B).  This  clone  was  pursued  further  because  the  operon  is  co-regulated  in  E.  coli 
30  by  both  ArcA  and  Fnr  (Sawers  and  Suppmann,  “Anaerobic  induction  of  pyruvate  formate-lyase 
gene  expression  is  mediated  by  the  ArcA  and  FNR  proteins”,  J.  Bacteriol.  174:3474-3478,  1992, 
Knappe  and  Sawers,  “A  radical-chemical  route  to  acetyl-CoA:  the  anaerobically  induced  pyruvate 
formate-lyase  system  of  Escherichia  coir,  FEMS  Microbiol.  Rev.  6:383-398,  1990). 
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Finally,  clone  28  contains  the  promoter  region  of  flhB,  a  gene  that  is  required  for  the  formation  of 
the  flagellar  apparatus  (Williams  et  al,  “Mutations  in  fliK  and  flhB  affecting  flagellar  hook  and 
filament  assembly  in  Salmonella  typhimurium"  J.  Bacteriol.  178:2960-2970,  1996)  and  is  not  known 
to  be  regulated  in  anaerobic  metabolism. 

5 

Further  screening  was  performed  on  these  three  clones.  Bacteria  containing  these  clones  were  i.v. 
injected  at  5  x  1 06,  5  x  1 07,  and  5  x  107  cfu  into  tumor  and  non-tumor-bearing  nude  mice.  One  or  2 
days  post-injection,  spleens  and  tumors  were  imaged  using  the  OVIOO  imaging  system, 
homogenized,  and  the  bacterial  titer  was  quantified  on  LB+Amp.  Spleens  from  normal  mice  were 
10  compared  with  tumors  that  had  a  similar  number  of  colony-forming  units,  so  that  any  difference  in 
fluorescence  would  be  attributable  to  increased  GFP  expression  rather  than  bacterial  numbers. 

FIG.  2  confirms  that  tumors  are  much  more  fluorescent  than  spleens  infected  with  the  same 
number  of  bacteria  for  each  of  the  three  clones.  A  positive  control  that  constitutively  expresses 
TurboGFP  resulted  in  strong  fluorescence  in  spleen  even  with  doses  as  low  as  2  x  105  cfu. 

15 

The  Salmonella  endogenous  promoter  for  pepT  is  regulated  by  CRP  and  Fnr  (Mengesha  et  al, 
2006).  In  previous  studies,  the  TATA  and  the  Fnr  binding  sites  of  this  promoter  were  modified  to 
engineer  a  hypoxia-inducible  promoter  that  drives  reporter  gene  expression  under  both  acute  and 
chronic  hypoxia  in  vitro  (Mengesha  et  al,  2006).  Induction  of  the  engineered  hypoxia-inducible 
20  promoter  in  vivo  became  detectable  in  mice  12  hours  after  death,  when  the  mouse  was  globally 
hypoxic  (Mengesha  et  al,  2006).  In  our  experiments,  the  wild-type  pepT  intergenic  region  did  not 
pass  the  threshold  to  be  included  in  the  tumor-specific  promoter  group.  Perhaps  the  appropriate 
clone  is  not  represented  in  the  library,  or  induction  (i.e.,  level  of  hypoxia  in  the  PC3  tumors)  was 
not  enough  for  this  particular  promoter. 

25 

In  summary,  Salmonella  thrives  in  the  hypoxic  conditions  found  in  solid  tumors  (Mengesha  et  al, 
2006).  There  are  four  promoters  known  to  be  regulated  by  hypoxia  among  the  20  sequenced 
intergenic  clones  (see  Tables  2A  and  2B),  of  which  two  (clones  10  and  45)  were  tested  and  shown 
to  be  induced  in  tumors  (see  FIG.  2).  Many  candidate  promoters  that  seem  to  be  preferentially 
30  activated  within  tumors  may  be  unrelated  to  hypoxia,  including  clone  28  (FIG.  2).  Any  promoters 
that  are  later  proven  to  respond  in  their  natural  context  in  the  genome  may  illuminate  conditions 
within  tumors,  other  than  hypoxia,  that  are  sensed  by  Salmonella. 
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Attenuated  Salmonella  strains  with  tumor  targeting  ability  can  be  used  to  deliver  therapeutics  under 
the  control  of  promoters  preferentially  induced  in  tumors  (Pawelek  et  al.  “Tumor-targeted 
Salmonella  as  a  novel  anticancer  vector”,  Cancer  Res  1997;  57:4537-44;  Zhao  et  al.  “Targeted 
therapy  with  a  Salmonella  typhimurium  leucine-arginine  auxotroph  cures  orthotopic  human  breast 
5  tumors  in  nude  mice”,  Cancer  Res  2006;  66:7647-52;  Zhao  et  al.  “Tumor-targeting  bacterial 

therapy  with  amino  acid  auxotrophs  of  GFP-expressing  Salmonella  typhimurium”,  Proc  Natl  Acad 
Sci  USA  2005;  102:755-60;  Zhao  et  al.  “Monotherapy  with  a  tumor-targeting  mutant  of  Salmonella 
typhimurium  cures  orthotopic  metastatic  mouse  models  of  human  prostate  cancer”,  Proc  Natl 
Acad  Sci  USA  2007;  Nishikawa  et  al.  “In  vivo  antigen  delivery  by  a  Salmonella  typhimurium  type 
10  III  secretion  system  for  therapeutic  cancer  vaccines”,  J  Clin  Invest  2006;  116:1946-54;  Panthel  et 
al.  “Prophylactic  anti-tumor  immunity  against  a  murine  fibrosarcoma  triggered  by  the  Salmonella 
type  III  secretion  system”,  Microbes  Infect  2006;  8:2539-46;  Thamm  et  al.  “Systemic  administration 
of  an  attenuated,  tumor-targeting  Salmonella  typhimurium  to  dogs  with  spontaneous  neoplasia: 
phase  I  evaluation”,  Clin  Cancer  Res  2005;  1 1 :4827-34;  Forbes  et  al.  “Sparse  initial  entrapment  of 
15  systemically  injected  Salmonella  typhimurium  leads  to  heterogeneous  accumulation  within  tumors”, 
Cancer  Res  2003;  63:5188-93;  Toso  et  al.  “Phase  I  study  of  the  intravenous  administration  of 
attenuated  Salmonella  typhimurium  to  patients  with  metastatic  melanoma”,  J  Clin  Oncol  2002; 
20:142-52;  Avogadri,  et  al.  “Cancer  immunotherapy  based  on  killing  of  Salmonella- infected  tumor 
cells”,  Cancer  Res  2005;  65:3920-7).  Such  promoters  are  technically  useful  whether  or  not  they 
20  are  regulated  in  the  same  way  in  their  natural  context  in  the  genome.  These  promoters  would  be 
tools  to  reduce  the  expression  of  the  therapeutic  in  bacteria  outside  the  tumor  and  thus  reduce 
side-effects,  and  thereby  produce  a  highly  selective  and  effective  therapy  of  metastatic  cancer. 
Further  sophistications  are  also  possible.  For  example,  combinations  of  two  or  more  promoters 
that  are  preferentially  induced  in  tumors  by  differing  regulatory  mechanisms  would  allow  delivery  of 
25  two  or  more  separate  protein  components  of  a  therapeutic  system  under  different  regulatory 
pathways.  In  addition,  new  promoter  systems  induced  by  external  agents  such  as  arabinose 
(Loessner  et  al.  “Remote  control  of  tumor- targeted  Salmonella  enterica  serovar  Typhimurium  by 
the  use  of  L-arabinose  as  inducer  of  bacterial  gene  expression  in  vivo",  Cell  Microbiol.  9:1529-37, 
2007)  or  salicylic  acid  (Royo  et  al.  “In  vivo  gene  regulation  in  Salmonella  spp.  by  a  salicylate- 
30  dependent  control  circuit”,  Nat.  Methods  4:937-42,  2007)  allow  promoters  in  Salmonella  to  be 

induced  throughout  the  body  at  a  time  of  choice.  Such  inducible  regulation  could  be  combined  with 
tumor-specific  Salmonella  promoters  to  express  useful  products  in  the  tumor  only  when  the 
exogenous  activator  is  added;  therapy  delivery  would  be  exquisitely  controlled  both  in  time  and 
space. 
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The  entirety  of  each  patent,  patent  application,  publication  and  document  referenced  herein  hereby 
is  incorporated  by  reference.  Citation  of  the  above  patents,  patent  applications,  publications  and 
5  documents  is  not  an  admission  that  any  of  the  foregoing  is  pertinent  prior  art,  nor  does  it  constitute 
any  admission  as  to  the  contents  or  date  of  these  publications  or  documents. 

Modifications  may  be  made  to  the  foregoing  without  departing  from  the  basic  aspects  of  the 
invention.  Although  the  invention  has  been  described  in  substantial  detail  with  reference  to  one  or 
1 0  more  specific  embodiments,  those  of  ordinary  skill  in  the  art  will  recognize  that  changes  may  be 

made  to  the  embodiments  specifically  disclosed  in  this  application,  yet  these  modifications  and 
improvements  are  within  the  scope  and  spirit  of  the  invention. 

The  invention  illustratively  described  herein  suitably  may  be  practiced  in  the  absence  of  any 
15  element(s)  not  specifically  disclosed  herein.  Thus,  for  example,  in  each  instance  herein  any  of  the 
terms  “comprising,”  “consisting  essentially  of,”  and  “consisting  of’  may  be  replaced  with  either  of 
the  other  two  terms.  The  terms  and  expressions  which  have  been  employed  are  used  as  terms  of 
description  and  not  of  limitation,  and  use  of  such  terms  and  expressions  do  not  exclude  any 
equivalents  of  the  features  shown  and  described  or  portions  thereof,  and  various  modifications  are 
20  possible  within  the  scope  of  the  invention  claimed.  The  term  “a”  or  “an”  can  refer  to  one  of  or  a 
plurality  of  the  elements  it  modifies  (e.g.,  “a  reagent”  can  mean  one  or  more  reagents)  unless  it  is 
contextually  clear  either  one  of  the  elements  or  more  than  one  of  the  elements  is  described.  The 
term  “about”  as  used  herein  refers  to  a  value  within  10%  of  the  underlying  parameter  (i.e.,  plus  or 
minus  1 0%),  and  use  of  the  term  “about”  at  the  beginning  of  a  string  of  values  modifies  each  of  the 
25  values  (i.e.,  “about  1 , 2  and  3”  refers  to  about  1 ,  about  2  and  about  3).  For  example,  a  weight  of 
“about  100  grams”  can  include  weights  between  90  grams  and  110  grams.  Further,  when  a  listing 
of  values  is  described  herein  (e.g.,  about  50%,  60%,  70%,  80%,  85%  or  86%)  the  listing  includes 
all  intermediate  and  fractional  values  thereof  (e.g.,  54%,  85.4%).  Thus,  it  should  be  understood 
that  although  the  present  invention  has  been  specifically  disclosed  by  representative  embodiments 
30  and  optional  features,  modification  and  variation  of  the  concepts  herein  disclosed  may  be  resorted 
to  by  those  skilled  in  the  art,  and  such  modifications  and  variations  are  considered  within  the  scope 
of  this  invention. 

Certain  embodiments  of  the  invention  are  set  forth  in  the  claims  that  follow. 
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What  is  claimed  is: 

1.  An  isolated  nucleic  acid  molecule  which  comprises  a  recombinant  expression  system, 
which  expression  system  comprises  a  nucleotide  sequence  encoding  a  toxic  or 
therapeutic  RNA  or  protein,  or  an  RNA  or  protein  that  participates  in  generating  a  toxin 
or  therapeutic  agent  operably  linked  to  a  heterologous  promoter  which  promoter  is 
preferentially  activated  in  solid  tumors. 

2.  The  isolated  nucleic  acid  molecule  of  claim  1  wherein  the  promoter  is  an 
Enterobacteriaceae  promoter. 

3.  The  isolated  nucleic  acid  molecule  of  claim  2  wherein  the  promoter  is  a  Salmonella 
promoter. 

4.  The  isolated  nucleic  acid  molecule  of  claim  3,  wherein  the  promoter  comprises  (i)  a 
nucleotide  sequence  of  Table  7A  and  Table  7B,  or  (ii)  a  functional  promoter 
subsequence  of  (i). 

5.  The  isolated  nucleic  acid  molecule  of  claim  4,  wherein  the  functional  promoter 
subsequence  is  about  20  to  about  150  nucleotides  in  length. 

6.  Recombinant  host  cells  that  contain  the  nucleic  acid  molecule  of  any  of  claims  1-5. 

7.  The  cells  of  claim  6  that  are  avirulent  Salmonella. 

8.  A  pharmaceutical  composition  which  comprises  the  nucleic  acid  molecule  of  claims  1- 
5  or  the  cells  of  claims  6-7. 

9.  A  method  to  treat  solid  tumors  which  method  comprises  administering  to  a  subject 
harboring  said  tumors  the  nucleic  acid  molecule  of  claims  1-5  or  the  cells  of  claims  6-7 
or  the  composition  of  claim  8. 

10.  A  method  for  identifying  a  promoter  preferentially  activated  in  tumor  tissue  which 
method  comprises: 
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(a)  providing  a  library  of  expression  systems  each  comprising  a  nucleotide 
sequence  encoding  a  detectable  protein  operably  linked  to  a  different  candidate 
promoter; 

(b)  providing  said  library  to  solid  tumor  tissue  and  to  normal  tissue; 

(c)  identifying  cells  from  each  tissue  that  show  high  levels  of  expression  of  the 
detectable  protein;  and 

(d)  obtaining  said  expression  systems  from  the  cells  that  produce  greater  levels 
of  detectable  protein  in  tumor  tissue  as  compared  to  normal  tissue,  and  identifying  the 
promoters  of  said  expression  system. 

1 1 .  The  method  of  claim  1 0  wherein  said  library  is  provided  in  recombinant  host  cells. 

12.  The  method  of  claim  10  or  claim  11  wherein  the  promoters  are  Salmonella  promoters 
and  the  recombinant  host  cells  are  Salmonella. 

13.  The  method  of  any  one  of  claims  10-12,  wherein  the  candidate  promoters  are  from 
bacteria,  or  are  80%  or  more  identical  to  promoters  from  bacteria. 

14.  The  method  of  claim  13,  wherein  the  bacteria  are  Enterobacteriaceae. 

15.  The  method  of  claim  14,  wherein  the  Enterobacteriaceae  are  Salmonella. 

16.  The  method  of  any  one  of  claims  10-15,  which  comprises  scoring  promoters 
identified  in  (d). 

17.  An  expression  system  which  comprises  a  nucleotide  sequence  encoding  a  toxic  or 
therapeutic  protein  or  a  protein  that  participates  in  generating  a  desired  toxin  or 
therapeutic  agent  operably  linked  to  a  promoter  identified  by  the  method  of  any  of  claims 
10-16. 


18.  Recombinant  host  cells  that  comprise  the  expression  system  of  claim  17. 

19.  A  method  to  treat  solid  tumors  which  method  comprises  administering  an  expression 
system  of  claim  1 7  or  the  cells  of  claim  1 8  to  a  subject  harboring  a  solid  tumor. 
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20.  The  method  of  claim  19,  wherein  the  protein  encoded  by  the  nucleotide  sequence 
comprises  enzymic  activity. 

21 .  The  method  of  claim  20,  which  comprises  administering  a  prodrug  to  the  subject 
that  does  not  inhibit  tumors,  wherein  the  protein  encoded  by  the  nucleotide  sequence 
coverts  the  prodrug  to  a  drug  that  inhibits  tumors. 

22.  An  expression  system  which  comprises  a  first  promoter  nucleotide  sequence 
operably  linked  to  a  first  coding  sequence  and  second  promoter  nucleotide  sequence 
operably  linked  to  a  second  coding  sequence,  wherein: 

the  first  coding  sequence  and  the  second  coding  sequence  encode  polypeptides 
that  individually  do  not  inhibit  tumor  growth; 

polypeptides  encoded  by  the  first  coding  sequence  and  the  second  coding 
sequence,  in  combination,  inhibit  tumor  growth;  and 

the  first  promoter  nucleotide  sequence  and  the  second  promoter  nucleotide 
sequence  are  preferentially  activated  in  solid  tumors. 

23.  The  expression  system  of  claim  22,  wherein  the  first  promoter  nucleotide  sequence 
and  the  second  promoter  nucleotide  sequence  are  in  the  same  nucleic  acid  molecule. 

24.  The  expression  system  of  claim  22,  wherein  the  first  promoter  nucleotide  sequence 
and  the  second  promoter  nucleotide  sequence  are  in  different  nucleic  acid  molecules. 

25.  The  expression  system  of  any  one  of  claims  22-24,  wherein  the  first  promoter 
nucleotide  sequence  and  the  second  promoter  nucleotide  sequence  are  bacterial 
nucleotide  sequences. 

26.  The  expression  system  of  claim  25,  wherein  the  bacterial  sequences  are 
Enterobacteriaceae  sequences. 

27.  The  expression  system  of  claim  26,  wherein  the  Enterobacteriaceae  sequences  are 
Salmonella  sequences. 
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28.  The  expression  system  of  any  one  of  claims  22-21,  wherein: 

the  first  coding  sequence  encodes  an  enzyme, 

the  second  coding  sequence  encodes  a  prodrug,  and 

the  enzyme  processes  the  prodrug  into  a  drug  that  inhibits  tumor  growth. 

29.  The  expression  system  of  any  one  of  claims  22-27,  wherein: 

the  first  coding  sequence  encodes  a  first  polypeptide, 
the  second  coding  sequence  encodes  a  second  polypeptide,  and 
the  first  polypeptide  and  the  second  polypeptide  form  a  complex  that  inhibits 
tumor  growth. 

30.  The  expression  system  of  any  one  of  claims  22-30,  wherein  the  first  promoter 
nucleotide  sequence,  the  second  promoter  nucleotide  sequence,  or  the  first  promoter 
nucleotide  sequence  and  the  second  promoter  nucleotide  sequence  comprise  (i)  a 
nucleotide  sequence  of  Table  7 A  and  Table  7B,  (ii)  a  functional  promoter  nucleotide 
sequence  80%  or  more  identical  to  a  nucleotide  sequence  of  Table  7A  and  Table  7B,  or 
(iii)  or  a  functional  promoter  subsequence  of  (i)  or  (ii). 

31 .  The  expression  system  of  claim  30,  wherein  the  functional  promoter  subsequence  is 
about  20  to  about  150  nucleotides  in  length. 

32.  Recombinant  host  cells  that  contain  the  expression  system  of  any  one  of  claims  22- 

SI. 

33.  The  cells  of  claim  32  that  are  avirulent  Salmonella. 

34.  An  expression  system  which  comprises  three  or  more  heterologous  promoter 
nucleotide  sequences  operably  linked  to  three  or  more  coding  sequences,  wherein  the 
promoter  nucleotide  sequences  are  preferentially  activated  in  solid  tumors. 

35.  The  expression  system  of  claim  34,  wherein  the  coding  sequences  encode 
polypeptides  that  individually  do  not  inhibit  tumor  growth,  and  the  polypeptides  encoded 
by  the  coding  sequences,  in  combination,  inhibit  tumor  growth. 
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36.  The  expression  system  of  claim  34  or  35,  wherein  the  promoter  nucleotide 
sequences  are  in  the  same  nucleic  acid  molecule. 

37.  The  expression  system  of  claim  34  or  35,  wherein  the  promoter  nucleotide 
sequences  are  in  different  nucleic  acid  molecules. 

38.  The  expression  system  of  any  one  of  claims  34-37,  wherein  the  promoter  nucleotide 
sequence  are  bacterial  nucleotide  sequences. 

39.  The  expression  system  of  claim  38,  wherein  the  bacterial  sequences  are 
Enterobacteriaceae  sequences. 

40.  The  expression  system  of  claim  39,  wherein  the  Enterobacteriaceae  sequences  are 
Salmonella  sequences. 

41 .  The  expression  system  of  any  one  of  claims  34-40,  wherein  the  first  promoter 
nucleotide  sequence,  the  second  promoter  nucleotide  sequence,  or  the  first  promoter 
nucleotide  sequence  and  the  second  promoter  nucleotide  sequence  comprise  (i)  a 
nucleotide  sequence  of  Table  7A  and  Table  7B,  (ii)  a  functional  promoter  nucleotide 
sequence  80%  or  more  identical  to  a  nucleotide  sequence  of  Table  7A  and  Table  7B,  or 
(iii)  or  a  functional  promoter  subsequence  of  (i)  or  (ii). 

42.  The  expression  system  of  claim  41 ,  wherein  the  functional  promoter  subsequence  is 
about  20  to  about  150  nucleotides  in  length. 

43.  Recombinant  host  cells  that  contain  the  expression  system  of  any  one  of  claims  34- 
42. 

44.  The  cells  of  claim  43  that  are  avirulent  Salmonella. 
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Abstract 


METHODS  TO  TREAT  SOLID  TUMORS 


A  high  throughput  method  for  identifying  promoters  differentially  activated  in  solid  tumors 
as  compared  to  normal  tissues  is  described.  The  promoters  so  identified  may  be  used 
to  drive  production  of  RNA’s  or  proteins  useful  in  treating  solid  tumors  including  toxic 
RNA’s  or  proteins  and  other  therapeutic  RNA’s  or  proteins. 
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Intra  tumor  injection  of  Salmonella  promoter  library  (Library-Q) 
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Bacteria  recovered  from  the  tumor 
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